ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

TL;DR
ManualVLA introduces a unified vision-language-action model with a planning and reasoning framework that improves robotic manipulation in complex, goal-oriented tasks by generating multimodal manuals and explicit control conditions.
Contribution
The paper presents ManualVLA, a novel Mixture-of-Transformers architecture that integrates manual generation and action execution, enabling better planning and manipulation in long-horizon tasks.
Findings
Achieves 32% higher success rate than previous SOTA on LEGO assembly.
Effectively generates multimodal manuals for complex tasks.
Utilizes a digital-twin toolkit for automatic manual data generation.
Abstract
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
