Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao; Zhuoran Wang; Haoyang Li; Shifeng Bao; Guanlin Li; Youhe Feng; Yang Li; Jie Tang; Jing Zhang

arXiv:2603.18091·cs.CV·March 20, 2026

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, Jing Zhang

PDF

Open Access

TL;DR

This paper introduces Action-Draft-and-Verify (ADV), a framework combining diffusion and auto-regressive methods for vision-language-action models, improving robustness and success rates in embodied tasks.

Contribution

The paper proposes a novel self-verifying framework that drafts multiple action candidates and selects the best using a VLM, enhancing performance over diffusion-only baselines.

Findings

01

+4.3 points success rate in simulation

02

+19.7 points success rate in real-world

03

Single-pass VLM reranking with improved robustness

Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis