Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu; Weiming Ren; Xiaoke Huang; Shoufa Chen; Tianhong Li; Mengzhao Chen; Yatai Ji; Sen He; Jonas Schult; Belinda Zeng; Tao Xiang; Wenhu Chen; Ping Luo; Luke Zettlemoyer; Yuren Cong

arXiv:2604.24763·cs.CV·May 19, 2026

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong

PDF

1 Repo

TL;DR

Tuna-2 introduces a unified multimodal model that directly uses pixel embeddings for understanding and generation, eliminating the need for pretrained vision encoders and achieving state-of-the-art results.

Contribution

It presents a novel encoder-free approach that simplifies architecture and enhances multimodal understanding and generation directly from raw pixels.

Findings

01

Tuna-2 achieves state-of-the-art performance on multimodal benchmarks.

02

Encoder-free design converges slower but yields stronger understanding at scale.

03

Pretrained vision encoders are not necessary for high-quality multimodal modelling.

Abstract

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/tuna-2
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.