DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang; Bing Zhan; Yunfei Yan; Yuqi Wang; Yingyan Li; Yasong An; Xiaoman Wang; Jierui Liu; Lu Hou; Lue Fan; Zhaoxiang Zhang; Tieniu Tan

arXiv:2603.11041·cs.CV·March 16, 2026

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan

PDF

Open Access

TL;DR

DynVLA introduces a novel Dynamics CoT paradigm for autonomous driving that forecasts compact world dynamics to improve decision-making, outperforming existing Textual and Visual CoT methods in various datasets.

Contribution

The paper presents DynVLA, a new model that predicts compact world dynamics using a Dynamics Tokenizer and decouples ego-centric and environment-centric dynamics for better accuracy.

Findings

01

DynVLA outperforms Textual CoT and Visual CoT in experiments.

02

Decoupling ego-centric and environment-centric dynamics improves modeling.

03

Dynamics Tokenizer enables compact and interpretable world dynamics representation.

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition