TL;DR
EVA01 introduces a unified Mixture-of-Transformers framework that natively integrates 3D mesh understanding and generation into multimodal large language models, enabling high-fidelity text-to-3D synthesis and advanced geometric editing.
Contribution
The paper presents EVA01, a novel architecture that seamlessly combines 3D understanding and generation within MLLMs using a Mixture-of-Transformers, advancing 3D-native multimodal AI.
Findings
Achieves state-of-the-art text-to-3D generation fidelity.
Enables robust, long-context geometric editing with identity preservation.
Provides architectural insights for integrating 2D foundation models with 3D tasks.
Abstract
This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert () and a structurally mirrored…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
