Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Xiangkai Ma; Lekai Xing; Han Zhang; Wenzhong Li; Sanglu Lu

arXiv:2511.19859·cs.RO·January 30, 2026

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, Sanglu Lu

PDF

Open Access

TL;DR

This paper introduces VITA, a hybrid-modality pipeline with an implicit visual chain-of-thought that unifies perception and action in robotic systems, achieving state-of-the-art results in complex environments.

Contribution

VITA proposes a shared discrete latent space for vision and action, enabling joint perception and motor control with an implicit visual CoT for improved motion planning.

Findings

01

VITA outperforms existing baselines on CALVIN, LIBERO, and SimplerEnv datasets.

02

Achieves an average success rate of 80.5% across six real-world tasks.

03

Demonstrates state-of-the-art performance in simulated and real environments.

Abstract

Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning