Foundation Model Robotics

Hi, I'm Kevin Ma.

Foundation Model Robotics — Research & Development

I research foundation models for embodied agents — studying how vision-language-action policies perceive, generalize, and act reliably in the physical world.

Currently Summer 2026 — Research Intern at Corvinus Labs, developing multi-modal VLA models & robotic interfaces.

GitHub Resume LinkedIn View Research

About

My work sits at the intersection of foundation models and embodied intelligence. I study how vision-language-action (VLA) models can be trained, evaluated, and made reliable enough to control real robots — with a particular focus on understanding where and why these policies break down before trusting them in the physical world.

I work across the full pipeline: generating training data in simulation with NVIDIA Isaac Lab, benchmarking VLA policies under realistic distribution shift, and making large vision-language models efficient enough to deploy through quantization and inference optimization. I care as much about rigorous, reproducible evaluation as I do about headline results.

Foundation Models Vision-Language-Action NVIDIA Isaac Lab Robot Learning Model Quantization TensorRT-LLM Sim-to-Real LeRobot

Experience

Summer 2026

Research Intern — VLA & Robotics

Corvinus Labs

Working directly alongside the core engineering team to develop multi-modal vision-language-action (VLA) models, design and test robotic interfaces, and contribute to physical hardware prototyping — hands-on, AI-driven automation within a real-world wet-lab development cycle.

Research

Selected Work

Three threads of my foundation-model robotics work — from generating data in simulation, to stress-testing policies, to making large models efficient to run.

Oracle — state machine + PPO teacher

↓

Multimodal capture — RGB-D, wrist, forces

↓

HDF5 → LeRobot v3.0

↓

ACT policy + closed-loop eval

Isaac Lab

sim2act — VLA Data Flywheel

A full sim-to-action pipeline in NVIDIA Isaac Lab: deterministic Warp state machines and a PPO teacher (4096 parallel envs) generate Franka pick, barrier, and push demonstrations; multimodal streams (overhead RGB-D, wrist RGB, joint kinematics, contact forces) are recorded and converted from HDF5 to LeRobot v3.0, then used to train ACT policies under rigorous closed-loop evaluation with camera ablations and out-of-distribution tests.

Barrier 90% (>75% oracle) · diagnosed a 0% push policy's camera shortcut

Isaac Lab Franka PPO Teacher LeRobot v3.0 ACT

Code →

Results grid comparing VLA manipulation policies under spatial distribution shift

VLA

Manipulation Robustness Under Distribution Shift

How well do VLA policies hold up when the robot starts somewhere new? I benchmark a monolithic SmolVLA policy against a hierarchical Gemini + IK system on tabletop pick-and-place in MuJoCo, across controlled spatial shifts in the initial pose.

Degrades beyond ~10 cm offset · 4 shift conditions

SmolVLA Gemini MuJoCo LeRobot

Code →

Quantization

Quantizing Qwen2-VL on TensorRT-LLM

A failure-mode study of LLM quantization applied to a vision-language model. Standard FP8 saturates the visual-token KV cache and collapses accuracy; SmoothQuant (W8A8) is Pareto-optimal — roughly 2× faster while retaining accuracy.

≈2× speedup · 7 configs evaluated

Qwen2-VL TensorRT-LLM SmoothQuant FP8

Code → Report →