AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

Zhi Jing; Jinbin Qiao; Ouyang Lu; Jicong Ao; Shuang Qiu; Yu-Gang Jiang; Chenjia Bai

arXiv:2604.08983·cs.RO·April 13, 2026

AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Yu-Gang Jiang, Chenjia Bai

PDF

TL;DR

AssemLM is a multimodal large language model designed for robotic assembly, integrating 3D perception and reasoning to predict precise 6D poses, supported by a new large-scale dataset for spatial reasoning.

Contribution

The paper introduces AssemLM, a novel spatial multimodal LLM for robotic assembly, and presents AssemBench, a large dataset for 3D spatial reasoning evaluation.

Findings

01

Achieves state-of-the-art 6D pose reasoning accuracy.

02

Supports fine-grained, multi-step robotic assembly in real-world tests.

03

Extends spatial reasoning benchmarks into full 3D geometric inference.

Abstract

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.