Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Jacky Kwok; Xilun Zhang; Mengdi Xu; Yuejiang Liu; Azalia Mirhoseini; Chelsea Finn; Marco Pavone

arXiv:2602.12281·cs.RO·February 19, 2026

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone

PDF

Open Access

TL;DR

This paper demonstrates that test-time verification, combined with scaling laws, significantly improves vision-language-action alignment in embodied instruction following, outperforming traditional policy scaling methods.

Contribution

The authors introduce CoVer, a contrastive verifier, and CoVer-VLA, a hierarchical verification pipeline, leveraging test-time scaling laws to enhance action alignment in vision-language models.

Findings

01

Scaling instructions and actions jointly improves diversity and correctness.

02

Verification-based approach outperforms policy pre-training scaling.

03

Significant gains on SIMPLER and PolaRiS benchmarks.

Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap." We first characterize the test-time scaling laws for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques