ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
Nastaran Darabi, and Amit Ranjan Trivedi

TL;DR
ProGAL-VLA introduces a grounded alignment approach for vision-language-action models, significantly improving robustness, language understanding, and ambiguity handling in robotic agents through symbolic grounding and intrinsic signals.
Contribution
It presents a novel grounding and alignment framework using a 3D entity graph, a slow planner, and a GAC loss to enhance instruction sensitivity and ambiguity awareness.
Findings
Robustness under robot perturbations increased from 30.3% to 71.5%.
Language ignorance reduced by 3-4 times.
Entity retrieval recall@1 improved from 0.41 to 0.71.
Abstract
Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding , whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
