DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Muxi Diao; Lele Yang; Hongbo Yin; Zhexu Wang; Yejie Wang; Daxin Tian; Kongming Liang; Zhanyu Ma

arXiv:2505.20665·cs.CV·January 14, 2026

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma

PDF

Open Access 3 Reviews

TL;DR

DriveRX is a novel vision-language reasoning model for autonomous driving that supports multi-stage decision-making, outperforming existing models in complex scenarios by generating structured reasoning chains.

Contribution

The paper introduces AutoDriveRL, a unified framework for structured reasoning in autonomous driving, and presents DriveRX, a cross-task vision-language model that enhances decision-making and robustness.

Findings

01

DriveRX outperforms GPT-4o in behavior reasoning.

02

DriveRX demonstrates robustness under complex or corrupted conditions.

03

Structured reasoning chains improve decision consistency.

Abstract

Effective autonomous driving hinges on robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for multi-stage decision-making. DriveRX achieves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The formulation of each sub-task as a vision-language QA problem is interesting. 2. The experimental results are promising.

Weaknesses

1. The introduction section feels somewhat wordy and difficult to follow. It’s not clearly structured around three key points: what common issues exist in current works, what direction and method are proposed, and how the proposed method is designed. These aspects should be laid out more clearly. 2. Additionally, the paper categorizes the autonomous driving pipeline into perception, prediction, planning, and behavior. Typically, autonomous driving is divided into three main tasks, perception, p

Reviewer 02Rating 6Confidence 4

Strengths

This study introduces a reinforcement-learning–driven, multi-task vision-language reasoning framework designed to enhance the coordination between perception, reasoning, and action in autonomous driving. By attaching task-specific reward models to each stage of the reasoning process, the approach provides finer-grained supervision, improving interpretability and stability. It organizes the problem into four subtasks that reflect the cognitive pipeline of autonomous driving—from perception to be

Weaknesses

1. Ambiguous Reinforcement Learning Details Equation (2) defines GRPO, but the intra-group advantage term (Eq. 3) is shared across tokens, not time-dependent, this simplifies training but may degrade credit assignment. No mention of reward normalization schedule, policy rollout horizon, or sampling temperature, which are essential for RL reproducibility. The rule-based and LLM-based reward models are described qualitatively but lack explicit scoring functions, thresholds, or examples (Appendix M

Reviewer 03Rating 6Confidence 5

Strengths

+ Proposes a unified RL framework (AutoDriveRL) enabling interpretable, multi-stage reasoning across driving subtasks. + Demonstrates state-of-the-art performance and robustness, surpassing larger models under both clean and corrupted conditions. + Extends practical impact by showing reasoning-enhanced transferability to trajectory and control tasks.

Weaknesses

**For the image input**: In the structured reasoning process you designed, including prediction and planning, you only feed the model multi-view images, which makes it very hard to ensure the model can capture temporal information like speed — this doesn’t really make sense. Although some other VLAs also only use images, they at least provide historical ego‐states to ensure the ego car’s temporal information. **Application**: The authors design a very complex reasoning process; I am quite puzzl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Autonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications