Quantize-then-Rectify: Efficient VQ-VAE Training
Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR
ReVQ introduces an efficient method to transform pre-trained VAEs into VQ-VAEs, drastically reducing training time while maintaining high-quality image reconstruction, enabling faster development of visual tokenizers for multimodal models.
Contribution
The paper presents ReVQ, a novel framework that leverages pre-trained VAEs with channel multi-group quantization and post rectification to enable rapid, low-cost VQ-VAE training.
Findings
ReVQ compresses ImageNet images into 512 tokens with high quality (rFID=1.06).
ReVQ reduces training time by over 100x compared to state-of-the-art methods.
ReVQ achieves competitive reconstruction quality with minimal computational resources.
Abstract
Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIterative Learning Control Systems
