VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
Hao Wang, Xiaobao Wei, Jingyang He, Chengyu Bai, Chun-Kai Fan, Jiajun Cao, Jintao Chen, Ying Li, Shanyu Rong, Ming Lu, Xiaozhu Ju, Jian Tang, Shanghang Zhang

TL;DR
VEGA introduces a framework that aligns visual encoder outputs with 3D-aware features to improve spatial reasoning in vision-language-action models for robotic manipulation.
Contribution
It directly aligns visual encoder outputs with 3D-aware features, enhancing spatial grounding interpretability and performance without added inference costs.
Findings
VEGA outperforms existing implicit spatial grounding methods.
Achieves state-of-the-art results on manipulation benchmarks.
No additional computational overhead during inference.
Abstract
Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
