FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

Zhide Zhong; Haodong Yan; Junfeng Li; Xiangchen Liu; Xin Gong; Tianran Zhang; Wenxuan Song; Jiayi Chen; Xinhu Zheng; Hesheng Wang; and Haoang Li

arXiv:2508.18269·cs.RO·October 8, 2025

FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, and Haoang Li

PDF

Open Access

TL;DR

FlowVLA introduces a motion reasoning paradigm for vision-language-action models, explicitly modeling dynamics via optical flow to improve visual prediction coherence and policy efficiency in robotics tasks.

Contribution

The paper proposes Visual CoT and FlowVLA, a novel autoregressive Transformer that explicitly reasons about motion dynamics before frame prediction, enhancing physical plausibility and policy performance.

Findings

01

Produces more coherent, physically plausible visual forecasts.

02

Achieves state-of-the-art policy performance on robotics benchmarks.

03

Improves sample efficiency in policy learning.

Abstract

Many Vision-Language-Action (VLA) models are built upon an internal world model trained via next-frame prediction `` $v_{t} \to v_{t + 1}$ ''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. \textbf{This lack of an explicit motion reasoning step} often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a paradigm that compels the model to first reason about \textbf{motion dynamics} before generating the future frame. We instantiate this paradigm by proposing \textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes this reasoning process as `` $v_{t} \to f_{t} \to v_{t + 1}$ '', where $f_{t}$ is an intermediate optical flow prediction that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Natural Language Processing Techniques