ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang; Yueh-Hua Wu; Min-Hung Chen; Yu-Chiang Frank Wang; Fu-En Yang

arXiv:2507.16815·cs.CV·September 19, 2025

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang

PDF

Open Access 1 Video

TL;DR

ThinkAct introduces a dual-system framework that combines high-level visual reasoning with low-level action execution using reinforced latent planning, significantly improving adaptability and planning in embodied AI tasks.

Contribution

It presents a novel visual latent planning approach that enhances reasoning and planning capabilities in vision-language-action tasks through reinforcement learning.

Findings

01

Enables few-shot adaptation in complex tasks

02

Supports long-horizon planning and self-correction

03

Improves robustness in embodied AI environments

Abstract

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques