AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly
Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Yu-Gang Jiang, Chenjia Bai

TL;DR
AssemLM is a multimodal large language model designed for robotic assembly, integrating 3D perception and reasoning to predict precise 6D poses, supported by a new large-scale dataset for spatial reasoning.
Contribution
The paper introduces AssemLM, a novel spatial multimodal LLM for robotic assembly, and presents AssemBench, a large dataset for 3D spatial reasoning evaluation.
Findings
Achieves state-of-the-art 6D pose reasoning accuracy.
Supports fine-grained, multi-step robotic assembly in real-world tests.
Extends spatial reasoning benchmarks into full 3D geometric inference.
Abstract
Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
