EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Zongyuan Yang; Mingjing Yi; Wanli Ma; Chenzhuo Fan; Bocheng Li; Baolin Liu; Yuke Lou; Yingde Song; Yongping Xiong; Zhengdong Guo; Shimu Wang

arXiv:2605.16745·cs.CV·May 19, 2026

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li, Baolin Liu, Yuke Lou, Yingde Song, Yongping Xiong, Zhengdong Guo, Shimu Wang

PDF

1 Repo

TL;DR

EVA01 introduces a unified Mixture-of-Transformers framework that natively integrates 3D mesh understanding and generation into multimodal large language models, enabling high-fidelity text-to-3D synthesis and advanced geometric editing.

Contribution

The paper presents EVA01, a novel architecture that seamlessly combines 3D understanding and generation within MLLMs using a Mixture-of-Transformers, advancing 3D-native multimodal AI.

Findings

01

Achieves state-of-the-art text-to-3D generation fidelity.

02

Enables robust, long-context geometric editing with identity preservation.

03

Provides architectural insights for integrating 2D foundation models with 3D tasks.

Abstract

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ( $E_{und}$ ) and a structurally mirrored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://www.seeles.ai/research/pages/EVA01
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.