DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving
Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan

TL;DR
DynVLA introduces a novel Dynamics CoT paradigm for autonomous driving that forecasts compact world dynamics to improve decision-making, outperforming existing Textual and Visual CoT methods in various datasets.
Contribution
The paper presents DynVLA, a new model that predicts compact world dynamics using a Dynamics Tokenizer and decouples ego-centric and environment-centric dynamics for better accuracy.
Findings
DynVLA outperforms Textual CoT and Visual CoT in experiments.
Decoupling ego-centric and environment-centric dynamics improves modeling.
Dynamics Tokenizer enables compact and interpretable world dynamics representation.
Abstract
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
