TL;DR
CorridorVLA introduces explicit spatial anchors as physical cues to improve generative action models in vision-language tasks, leading to significant success rate improvements on challenging benchmarks.
Contribution
It proposes a novel method that uses sparse physical anchors to explicitly guide action generation, enhancing interpretability and performance.
Findings
Achieves 3.4% to 12.4% success rate improvements on LIBERO-Plus benchmark.
GR00T-Corr variant reaches 83.21% success rate.
Provides interpretable physical constraints for generative policies.
Abstract
Vision--Language--Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose , which predicts sparse spatial anchors as incremental physical changes (e.g., -positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by -- over the corresponding baselines; notably, our GR00T-Corr variant reaches a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
