HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization
Xuerui Qiu, Yutao Cui, Guozhen Zhang, Junzhe Li, JiaKui Hu, Xiao Zhang, Yang Li, Songtao Liu, Miles Yang, Yu Shi, Zhao Zhong, Liefeng Bo

TL;DR
HYDRA introduces a unified multimodal model that seamlessly integrates visual understanding and generation through a novel representation-harmonized tokenization approach, advancing state-of-the-art performance in multiple benchmarks.
Contribution
The paper presents HYDRA, a novel framework that unifies perception and generation in a single model using a progressive ViT architecture with a Generation-Semantic Bottleneck for improved multimodal understanding and synthesis.
Findings
Sets new state-of-the-art in visual reconstruction and generation benchmarks.
Outperforms previous models by an average of 10 points on understanding benchmarks.
Achieves top-tier results on GenEval, DPG-Bench, and WISE datasets.
Abstract
Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
