Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu; Jiayi Guan; Zhijian Huang; Jinlong Li; Guang Li; Lingdong Kong; Yingyan Li; Han Wang; Shaoqing Xu; Yuechen Luo; Fang Li; Chenxu Dang; Junli Wang; Tao Xu; Jing Wu; Jianhua Wu; Xiaoshuai Hao; Wen Zhang; Tianyi Jiang; Lingfeng Zhang; Lei Zhou; Yingbo Tang; Jie Wang; Yinfeng Gao; Xizhou Bu; Haochen Tian; Yihang Qiu; Feiyang Jia; Lin Liu; Yigu Ge; Hanbing Li; Yuannan Shen; Jianwei Cui; Hongwei Xie; Bing Wang; Haiyang Sun; Jingwei Zhao; Jiahui Huang; Pei Liu; Zeyu Zhu; Yuncheng Jiang; Zibin Guo; Chuhong Gong; Hanchao Leng; Kun Ma; Naiyan Wang; Guang Chen; Kuiyuan Yang; Hangjun Ye; Long Chen

arXiv:2604.18486·cs.CV·May 12, 2026

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang

PDF

2 Repos 7 Models

TL;DR

OneVL introduces a unified latent reasoning framework for autonomous driving that internalizes causal dynamics, enabling faster inference and surpassing explicit chain-of-thought methods in accuracy.

Contribution

It presents a novel latent CoT approach with a visual world model decoder, improving speed and accuracy over explicit reasoning methods in autonomous driving tasks.

Findings

01

OneVL outperforms explicit CoT on four benchmarks.

02

It achieves answer-only latency comparable to direct prediction.

03

Latent CoT with world model supervision yields more generalizable representations.

Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.