URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model
Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, Shanghang Zhang

TL;DR
URDF-Anything is an end-to-end framework that automatically reconstructs articulated objects using a 3D multimodal large language model, improving segmentation, kinematic prediction, and generalization for robotic applications.
Contribution
It introduces a novel autoregressive prediction framework with a specialized token mechanism for joint geometric and kinematic object reconstruction.
Findings
17% improvement in geometric segmentation mIoU
29% reduction in kinematic prediction error
50% improvement in physical executability
Abstract
Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Motion and Animation · Multimodal Machine Learning Applications
