Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Chunshi Wang; Junliang Ye; Yunhan Yang; Yang Li; Zizhuo Lin; Jun Zhu; Zhuo Chen; Yawei Luo; Chunchao Guo

arXiv:2511.13647·cs.CV·November 18, 2025

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo

PDF

Open Access 3 Reviews

TL;DR

Part-X-MLLM is a novel 3D multimodal large language model that unifies diverse 3D tasks through structured program generation from RGB point clouds and natural language prompts, enabling versatile geometry-aware applications.

Contribution

It introduces a structured, program-based approach to 3D multimodal understanding, combining symbolic planning with geometric synthesis for the first time.

Findings

01

Achieves state-of-the-art results in grounded Q&A and compositional generation.

02

Enables localized editing through a unified language interface.

03

Demonstrates high-quality, structured planning for 3D tasks.

Abstract

We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

This paper presents Part-X-MLLM, a 3D large language model for diverse 3D tasks by formulating them as programs in an executable grammar. Overall, the work is decent, includes a large curated dataset, and is generalizable across 11 tasks. + The two-stage training process is interesting to help the model learn the underlying 3D structure and associate the pretrained language knowledge with it. + Semantic Granularity Control is an interesting part of the work. The part-aware synthesis is useful

Weaknesses

- Small typo in line 192 'boxe' - The writing is a bit hard to follow in places. For example, in line 35, I am not sure why Part-X-MLLM is 'native'. Similarly, the mention of 'structural opaqueness' in line 53 is not clear. - The distinction with past works is not clear enough. I would love to see a table comparing past works and Part-X-MLLM. - For the qualitative analysis, it would have been great to see a small-scale study with real participants and evaluate the performance of Part-X-MLLM q

Reviewer 02Rating 4Confidence 5

Strengths

1.Addressing part-level 3D multimodal modeling is timely and necessary. 2.The proposed dual-encoder design effectively encodes complementary attributes of 3D objects. 3.The use of task-specific prompts and special tokens enables diverse part-centric tasks within a unified framework.

Weaknesses

1.Evaluation metrics rely mainly on traditional natural-language metrics; consider including LLM-based scoring (e.g., GPT-judge) for more robust assessment. 2.Baselines: comparison is limited; please include strong 2025-era SOTA 3D multimodal models on QA and grounding tasks (e.g., Mini-GPT-3D). 3.Benchmark: experiments are primarily on the authors' dataset; please evaluate on established public 3D benchmarks, such as the Point-LLM test suite, and include metrics for point resolution sensitivi

Reviewer 03Rating 6Confidence 3

Strengths

- The research on part-based 3D generation is highly practical, and the authors have designed a unified framework that integrates 3D generation, understanding, and editing, which is very valuable. - The paper not only proposes a large model, Part-X-MLLM, but also introduces a new benchmark and includes extensive experimental comparisons in both 3D generation and understanding, demonstrating substantial effort. - The writing is clear and easy to follow, and the figures are professionally designed

Weaknesses

See the "Questions" section.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Human Motion and Animation · Multimodal Machine Learning Applications