ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Chenyang Gu; Jiaming Liu; Hao Chen; Runzhong Huang; Qingpo Wuwu; Zhuoyang Liu; Xiaoqi Li; Ying Li; Renrui Zhang; Peng Jia; Pheng-Ann Heng; Shanghang Zhang

arXiv:2512.02013·cs.RO·December 2, 2025

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

PDF

Open Access

TL;DR

ManualVLA introduces a unified vision-language-action model with a planning and reasoning framework that improves robotic manipulation in complex, goal-oriented tasks by generating multimodal manuals and explicit control conditions.

Contribution

The paper presents ManualVLA, a novel Mixture-of-Transformers architecture that integrates manual generation and action execution, enabling better planning and manipulation in long-horizon tasks.

Findings

01

Achieves 32% higher success rate than previous SOTA on LEGO assembly.

02

Effectively generates multimodal manuals for complex tasks.

03

Utilizes a digital-twin toolkit for automatic manual data generation.

Abstract

Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI