Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone

TL;DR
This paper demonstrates that test-time verification, combined with scaling laws, significantly improves vision-language-action alignment in embodied instruction following, outperforming traditional policy scaling methods.
Contribution
The authors introduce CoVer, a contrastive verifier, and CoVer-VLA, a hierarchical verification pipeline, leveraging test-time scaling laws to enhance action alignment in vision-language models.
Findings
Scaling instructions and actions jointly improves diversity and correctness.
Verification-based approach outperforms policy pre-training scaling.
Significant gains on SIMPLER and PolaRiS benchmarks.
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap." We first characterize the test-time scaling laws for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
