Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

Lijin Yang; Jianing Huang; Zhongzhan Huang; Shu Liu; Hao Yang

arXiv:2604.27366·cs.CV·May 1, 2026

Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

Lijin Yang, Jianing Huang, Zhongzhan Huang, Shu Liu, Hao Yang

PDF

TL;DR

This paper introduces CriticVLA, a two-stage vision language action framework for autonomous driving that uses a critic to evaluate and refine driving trajectories, significantly improving performance.

Contribution

The paper proposes a novel critic-centric VLA framework with a large synthetic dataset, enhancing decision refinement in autonomous driving tasks.

Findings

01

CriticVLA achieves 73.33% success rate on Bench2Drive.

02

It delivers about 30% improvement in challenging scenarios.

03

The framework outperforms state-of-the-art baselines.

Abstract

Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.