TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Zhiheng Liu; Weiming Ren; Haozhe Liu; Zijian Zhou; Shoufa Chen; Haonan Qiu; Xiaoke Huang; Zhaochong An; Fanny Yang; Aditya Patel; Viktar Atliha; Tony Ng; Xiao Han; Chuyan Zhu; Chenyang Zhang; Ding Liu; Juan-Manuel Perez-Rua; Sen He; J\"urgen Schmidhuber; Wenhu Chen; Ping Luo; Wei Liu; Tao Xiang; Jonas Schult; Yuren Cong

arXiv:2512.02014·cs.CV·December 2, 2025

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, J\"urgen Schmidhuber, Wenhu Chen, Ping Luo

PDF

Open Access

TL;DR

TUNA introduces a unified visual representation for multimodal models, enabling end-to-end understanding and generation of images and videos, outperforming previous decoupled approaches across multiple benchmarks.

Contribution

It proposes a native UMM with a unified visual space using a VAE and representation encoder, improving performance and scalability over prior decoupled models.

Findings

01

Achieves state-of-the-art results in multimodal understanding and generation.

02

Unified representation improves task performance and scalability.

03

Joint training benefits both understanding and generation tasks.

Abstract

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning