VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Chaoyang Wang; Wenrui Bao; Sicheng Gao; Bingxin Xu; Yu Tian; Yogesh S. Rawat; Yunhao Ge; Yuzhang Shang

arXiv:2603.14523·cs.CV·March 17, 2026

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang

PDF

Open Access

TL;DR

VLA-Thinker introduces a dynamic reasoning framework for vision-language-action models, enabling active environment revisiting and improved long-horizon task performance through a two-stage training pipeline.

Contribution

It presents a novel thinking-with-image reasoning approach with a specialized training pipeline, enhancing embodied intelligence in VLA models.

Findings

01

Achieves 97.5% success on LIBERO benchmark.

02

Significantly improves manipulation performance.

03

Demonstrates strong gains on long-horizon robotic tasks.

Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics