CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

Junming Huang; Chi Wang; Letian Li; Guangkai Xu; Donglin Huang; Hao Chen; Qiang Dai; Weiwei Xu

arXiv:2601.21798·cs.CV·May 18, 2026

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

Junming Huang, Chi Wang, Letian Li, Guangkai Xu, Donglin Huang, Hao Chen, Qiang Dai, Weiwei Xu

PDF

1 Repo

TL;DR

CG-MLLM introduces a multi-modal large language model capable of high-resolution 3D captioning and generation, integrating a novel architecture to improve fidelity and understanding of 3D content.

Contribution

The paper presents CG-MLLM, a new architecture that enables high-resolution 3D content generation and captioning within a unified multi-modal large language model framework.

Findings

01

Outperforms existing MLLMs in high-fidelity 3D object generation

02

Facilitates long-context interactions between tokens and spatial blocks

03

Enhances 3D understanding through generation training

Abstract

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dreaming-huang/CG-MLLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis