IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Anqing Jiang; Yu Gao; Yiru Wang; Zhigang Sun; Shuo Wang; Yuwen Heng; Hao Sun; Shichen Tang; Lijuan Zhu; Jinhao Chai; Jijun Wang; Zichong Gu; Hao Jiang; Li Sun

arXiv:2508.06571·cs.AI·August 18, 2025

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun

PDF

TL;DR

This paper introduces IRL-VLA, a novel framework for training vision-language-action models in autonomous driving using a reward world model and reinforcement learning, addressing limitations of previous open-loop and simulation-dependent methods.

Contribution

The paper presents a three-stage training paradigm combining imitation learning, inverse reinforcement learning, and reinforcement learning with a reward world model for VLA in autonomous driving.

Findings

01

Achieved state-of-the-art performance on NAVSIM v2 benchmark.

02

Secured 1st runner-up in CVPR2025 Autonomous Grand Challenge.

03

Demonstrated effective close-loop training without high-fidelity simulation.

Abstract

Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.