Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation
Jin Shi, Brady Zhang, Yishun Lu

TL;DR
This paper presents VLA-AD, a framework that uses vision-language models as offline semantic supervisors to distill large vision-language-action policies into lightweight, efficient models suitable for real-time robotic control.
Contribution
The authors introduce a novel semantic distillation method that enhances policy efficiency and robustness without requiring the teacher or VLM during inference.
Findings
Student policy reduces model size by 44 times while maintaining performance.
The method achieves a 3.28 times inference speedup over the teacher.
Semantic guidance improves robustness to noisy teacher actions.
Abstract
Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
