Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

Chenrui Tie; Shengxiang Sun; Jinxuan Zhu; Yiwei Liu; Jingxiang Guo; Yue Hu; Haonan Chen; Junting Chen; Ruihai Wu; Lin Shao

arXiv:2502.10090·cs.RO·October 21, 2025

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao

PDF

Open Access 1 Repo

TL;DR

Manual2Skill enables robots to interpret high-level manual instructions and perform complex furniture assembly tasks by integrating vision-language models, hierarchical graph construction, pose estimation, and motion planning, demonstrating practical real-world applications.

Contribution

The paper introduces Manual2Skill, a novel framework that combines vision-language models, hierarchical assembly graphs, pose estimation, and motion planning for robotic furniture assembly from manuals.

Findings

01

Successfully assembled IKEA furniture items in real-world tests.

02

Enhanced robot understanding and execution of complex, long-horizon tasks.

03

Demonstrated practical application of vision-language models in robotics.

Abstract

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

owensun2004/Manual2Skill
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBIM and Construction Integration · Augmented Reality Applications · Robotics and Automated Systems