HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

Xuerui Qiu; Yutao Cui; Guozhen Zhang; Junzhe Li; JiaKui Hu; Xiao Zhang; Yang Li; Songtao Liu; Miles Yang; Yu Shi; Zhao Zhong; Liefeng Bo

arXiv:2603.15228·cs.CV·March 18, 2026

HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

Xuerui Qiu, Yutao Cui, Guozhen Zhang, Junzhe Li, JiaKui Hu, Xiao Zhang, Yang Li, Songtao Liu, Miles Yang, Yu Shi, Zhao Zhong, Liefeng Bo

PDF

Open Access

TL;DR

HYDRA introduces a unified multimodal model that seamlessly integrates visual understanding and generation through a novel representation-harmonized tokenization approach, advancing state-of-the-art performance in multiple benchmarks.

Contribution

The paper presents HYDRA, a novel framework that unifies perception and generation in a single model using a progressive ViT architecture with a Generation-Semantic Bottleneck for improved multimodal understanding and synthesis.

Findings

01

Sets new state-of-the-art in visual reconstruction and generation benchmarks.

02

Outperforms previous models by an average of 10 points on understanding benchmarks.

03

Achieves top-tier results on GenEval, DPG-Bench, and WISE datasets.

Abstract

Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning