TL;DR
CG-MLLM introduces a multi-modal large language model capable of high-resolution 3D captioning and generation, integrating a novel architecture to improve fidelity and understanding of 3D content.
Contribution
The paper presents CG-MLLM, a new architecture that enables high-resolution 3D content generation and captioning within a unified multi-modal large language model framework.
Findings
Outperforms existing MLLMs in high-fidelity 3D object generation
Facilitates long-context interactions between tokens and spatial blocks
Enhances 3D understanding through generation training
Abstract
Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
