VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Sinan Du; Jiahao Guo; Bo Li; Shuhao Cui; Zhengzhuo Xu; Yifu Luo; Yongxian Wei; Kun Gai; Xinggang Wang; Kai Wu; Chun Yuan

arXiv:2511.23386·cs.CV·December 1, 2025

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan

PDF

Open Access

TL;DR

VQRAE introduces a unified representation model combining continuous semantic features and discrete tokens for multimodal understanding, generation, and reconstruction, leveraging a novel high-dimensional vector quantization approach.

Contribution

The paper pioneers a unified tokenizer that produces both continuous and discrete representations within a single autoencoder framework, enabling multimodal tasks.

Findings

01

Achieves 100% codebook utilization at high dimensions.

02

Demonstrates competitive performance on visual understanding benchmarks.

03

Shows promising scaling in autoregressive generation tasks.

Abstract

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Face recognition and analysis