FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models
Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, and Haoang Li

TL;DR
FlowVLA introduces a motion reasoning paradigm for vision-language-action models, explicitly modeling dynamics via optical flow to improve visual prediction coherence and policy efficiency in robotics tasks.
Contribution
The paper proposes Visual CoT and FlowVLA, a novel autoregressive Transformer that explicitly reasons about motion dynamics before frame prediction, enhancing physical plausibility and policy performance.
Findings
Produces more coherent, physically plausible visual forecasts.
Achieves state-of-the-art policy performance on robotics benchmarks.
Improves sample efficiency in policy learning.
Abstract
Many Vision-Language-Action (VLA) models are built upon an internal world model trained via next-frame prediction ``''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. \textbf{This lack of an explicit motion reasoning step} often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a paradigm that compels the model to first reason about \textbf{motion dynamics} before generating the future frame. We instantiate this paradigm by proposing \textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes this reasoning process as ``'', where is an intermediate optical flow prediction that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Natural Language Processing Techniques
