Foundation Model Robotics

Hi, I'm Kevin Ma.

Foundation Model Robotics — Research & Development

I research foundation models for embodied agents — studying how vision-language-action policies perceive, generalize, and act reliably in the physical world.

Currently Summer 2026 — Research Intern at Corvinus Labs, developing multi-modal VLA models & robotic interfaces.

About

My work sits at the intersection of foundation models and embodied intelligence. I study how vision-language-action (VLA) models can be trained, evaluated, and made reliable enough to control real robots — with a particular focus on understanding where and why these policies break down before trusting them in the physical world.

I work across the full pipeline: generating training data in simulation with NVIDIA Isaac Lab, benchmarking VLA policies under realistic distribution shift, and making large vision-language models efficient enough to deploy through quantization and inference optimization. I care as much about rigorous, reproducible evaluation as I do about headline results.

Foundation Models Vision-Language-Action NVIDIA Isaac Lab Robot Learning Model Quantization TensorRT-LLM Sim-to-Real LeRobot

Experience

Summer 2026

Research Intern — VLA & Robotics

Working directly alongside the core engineering team to develop multi-modal vision-language-action (VLA) models, design and test robotic interfaces, and contribute to physical hardware prototyping — hands-on, AI-driven automation within a real-world wet-lab development cycle.

Research

Selected Work

Three threads of my foundation-model robotics work — from generating data in simulation, to stress-testing policies, to making large models efficient to run.

Oracle — state machine + PPO teacher
Multimodal capture — RGB-D, wrist, forces
HDF5 → LeRobot v3.0
ACT policy + closed-loop eval
Isaac Lab

sim2act — VLA Data Flywheel

A full sim-to-action pipeline in NVIDIA Isaac Lab: deterministic Warp state machines and a PPO teacher (4096 parallel envs) generate Franka pick, barrier, and push demonstrations; multimodal streams (overhead RGB-D, wrist RGB, joint kinematics, contact forces) are recorded and converted from HDF5 to LeRobot v3.0, then used to train ACT policies under rigorous closed-loop evaluation with camera ablations and out-of-distribution tests.

Barrier 90% (>75% oracle) · diagnosed a 0% push policy's camera shortcut
Isaac Lab Franka PPO Teacher LeRobot v3.0 ACT
Results grid comparing VLA manipulation policies under spatial distribution shift
VLA

Manipulation Robustness Under Distribution Shift

How well do VLA policies hold up when the robot starts somewhere new? I benchmark a monolithic SmolVLA policy against a hierarchical Gemini + IK system on tabletop pick-and-place in MuJoCo, across controlled spatial shifts in the initial pose.

Degrades beyond ~10 cm offset · 4 shift conditions
SmolVLA Gemini MuJoCo LeRobot
Speed-accuracy tradeoff chart for Qwen2-VL quantization configurations
Quantization

Quantizing Qwen2-VL on TensorRT-LLM

A failure-mode study of LLM quantization applied to a vision-language model. Standard FP8 saturates the visual-token KV cache and collapses accuracy; SmoothQuant (W8A8) is Pareto-optimal — roughly 2× faster while retaining accuracy.

≈2× speedup · 7 configs evaluated
Qwen2-VL TensorRT-LLM SmoothQuant FP8