Quantize-then-Rectify: Efficient VQ-VAE Training

Borui Zhang; Qihang Rao; Wenzhao Zheng; Jie Zhou; Jiwen Lu

arXiv:2507.10547·cs.CV·July 15, 2025

Quantize-then-Rectify: Efficient VQ-VAE Training

Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu

PDF

Open Access 3 Models

TL;DR

ReVQ introduces an efficient method to transform pre-trained VAEs into VQ-VAEs, drastically reducing training time while maintaining high-quality image reconstruction, enabling faster development of visual tokenizers for multimodal models.

Contribution

The paper presents ReVQ, a novel framework that leverages pre-trained VAEs with channel multi-group quantization and post rectification to enable rapid, low-cost VQ-VAE training.

Findings

01

ReVQ compresses ImageNet images into 512 tokens with high quality (rFID=1.06).

02

ReVQ reduces training time by over 100x compared to state-of-the-art methods.

03

ReVQ achieves competitive reconstruction quality with minimal computational resources.

Abstract

Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIterative Learning Control Systems