URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li; Xiang Bai; Jieyu Zhang; Zhuangzhe Wu; Che Xu; Ying Li; Chengkai Hou; Shanghang Zhang

arXiv:2511.00940·cs.RO·November 4, 2025

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, Shanghang Zhang

PDF

Open Access

TL;DR

URDF-Anything is an end-to-end framework that automatically reconstructs articulated objects using a 3D multimodal large language model, improving segmentation, kinematic prediction, and generalization for robotic applications.

Contribution

It introduces a novel autoregressive prediction framework with a specialized token mechanism for joint geometric and kinematic object reconstruction.

Findings

01

17% improvement in geometric segmentation mIoU

02

29% reduction in kinematic prediction error

03

50% improvement in physical executability

Abstract

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized $[S E G]$ token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Motion and Animation · Multimodal Machine Learning Applications