GLaD: Geometric Latent Distillation for Vision-Language-Action Models
Minghao Guo, Meng Cao, Jiachen Tao, Rongtao Xu, Yan Yan, Xiaodan Liang, Ivan Laptev, Xiaojun Chang

TL;DR
GLaD introduces a geometry-aware framework for vision-language-action models that incorporates 3D geometric priors via knowledge distillation, significantly improving spatial reasoning and task success rates.
Contribution
The paper presents GLaD, a novel geometry-aware VLA framework that integrates 3D geometric priors into multimodal representations through knowledge distillation during pretraining.
Findings
GLaD achieves 94.1% success rate on LIBERO tasks, outperforming previous models.
Geometry-aware pretraining enhances spatial reasoning without explicit depth sensors.
Deep integration of geometric understanding improves policy generalization.
Abstract
Most existing Vision-Language-Action (VLA) models rely primarily on RGB information, while ignoring geometric cues crucial for spatial reasoning and manipulation. In this work, we introduce GLaD, a geometry-aware VLA framework that incorporates 3D geometric priors during pretraining through knowledge distillation. Rather than distilling geometric features solely into the vision encoder, we align the LLM's hidden states corresponding to visual tokens with features from a frozen geometry-aware vision transformer (VGGT), ensuring that geometric understanding is deeply integrated into the multimodal representations that drive action prediction. Pretrained on the Bridge dataset with this geometry distillation mechanism, GLaD achieves 94.1% average success rate across four LIBERO task suites, outperforming UniVLA (92.5%) which uses identical pretraining data. These results validate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
