Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo

TL;DR
Part-X-MLLM is a novel 3D multimodal large language model that unifies diverse 3D tasks through structured program generation from RGB point clouds and natural language prompts, enabling versatile geometry-aware applications.
Contribution
It introduces a structured, program-based approach to 3D multimodal understanding, combining symbolic planning with geometric synthesis for the first time.
Findings
Achieves state-of-the-art results in grounded Q&A and compositional generation.
Enables localized editing through a unified language interface.
Demonstrates high-quality, structured planning for 3D tasks.
Abstract
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing…
Peer Reviews
Decision·ICLR 2026 Poster
This paper presents Part-X-MLLM, a 3D large language model for diverse 3D tasks by formulating them as programs in an executable grammar. Overall, the work is decent, includes a large curated dataset, and is generalizable across 11 tasks. + The two-stage training process is interesting to help the model learn the underlying 3D structure and associate the pretrained language knowledge with it. + Semantic Granularity Control is an interesting part of the work. The part-aware synthesis is useful
- Small typo in line 192 'boxe' - The writing is a bit hard to follow in places. For example, in line 35, I am not sure why Part-X-MLLM is 'native'. Similarly, the mention of 'structural opaqueness' in line 53 is not clear. - The distinction with past works is not clear enough. I would love to see a table comparing past works and Part-X-MLLM. - For the qualitative analysis, it would have been great to see a small-scale study with real participants and evaluate the performance of Part-X-MLLM q
1.Addressing part-level 3D multimodal modeling is timely and necessary. 2.The proposed dual-encoder design effectively encodes complementary attributes of 3D objects. 3.The use of task-specific prompts and special tokens enables diverse part-centric tasks within a unified framework.
1.Evaluation metrics rely mainly on traditional natural-language metrics; consider including LLM-based scoring (e.g., GPT-judge) for more robust assessment. 2.Baselines: comparison is limited; please include strong 2025-era SOTA 3D multimodal models on QA and grounding tasks (e.g., Mini-GPT-3D). 3.Benchmark: experiments are primarily on the authors' dataset; please evaluate on established public 3D benchmarks, such as the Point-LLM test suite, and include metrics for point resolution sensitivi
- The research on part-based 3D generation is highly practical, and the authors have designed a unified framework that integrates 3D generation, understanding, and editing, which is very valuable. - The paper not only proposes a large model, Part-X-MLLM, but also introduces a new benchmark and includes extensive experimental comparisons in both 3D generation and understanding, demonstrating substantial effort. - The writing is clear and easy to follow, and the figures are professionally designed
See the "Questions" section.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Human Motion and Animation · Multimodal Machine Learning Applications
