Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao, Yuanjun Xiong, Philipp Kr\"ahenb\"uhl

TL;DR
This paper introduces Binary Spherical Quantization, a novel transformer-based tokenizer for images and videos that achieves high compression efficiency, state-of-the-art reconstruction quality, and competitive image synthesis capabilities.
Contribution
It presents BSQ, a parameter-efficient, scalable, and compact quantization method that improves visual data compression and reconstruction in transformer models.
Findings
Achieves state-of-the-art image and video reconstruction quality.
Compresses visual data by up to 100 times with minimal distortion.
Enables competitive image synthesis comparable to GANs and diffusion models.
Abstract
We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100 with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4 throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper presents an innovative quantization method (BSQ) that addresses the limitations of existing vector quantization approaches by offering a more efficient and scalable solution. 2. Extensive experiments on benchmarks such as ImageNet and UCF-101 demonstrate that BSQ-ViT significantly improves reconstruction quality, outperforming prior methods in terms of speed and fidelity. 3. The methodology is clearly explained with detailed comparisons to related work, and the theoretical basis of
1. While the transformer architecture is explored, the paper does not demonstrate the effectiveness of BSQ within a CNN-based model. 2. The paper provides limited comparative data in video reconstruction, reducing the robustness of the comparison. Additionally, while block-wise causal attention is noted to impact performance, the study lacks experiments on BSQ without this causal masking. 3. The reported image and video compression results are better on MS-SSIM, potentially due to the inclusi
The Binary Spherical Quantization seems to show more effective training of the qunatization bottleneck. Analysis shows that the proposed method can provide fast speed and good performance.
- Lack of comparison at different bitrate range for visual compression results. Table 4 only provides BPP, PSNR and MS-SSIM for one bitrate point. However, visual compression tasks usually require showing a Rate-Distortion curves and compare at different bitrate points. Your can use BD-Rate metric for more reasonable comparison and analyze the results at low bitrate and high bitrate. - Test settings for ablation study. Please provide more experiment setting details. In Table 5, do VQ, LFQ and B
1. The idea of projecting high-dimensional visual embeddings onto a lower-dimensional hypersphere is straightforward yet effective. 2. The motivation is clear, and the overall presentation is coherent and easy to follow. The experiments are comprehensive and provide convincing evidence to support the approach. 3. The BSQ-ViT model achieves competitive performance in diverse tasks such as image/video reconstruction, generation, and compression.
1. This method uses a transformer encoder and decoder, which limit the flexibility in resolution. How do the authors address this issue? 2. For image and video compression results, it would be beneficial to include an LPIPS comparison to assess perceptual performance. 3. There are a few minor issues. 1) In Eq. (7), $\hat{q}(c|u) = \frac{\exp(\tau c^{T}u)}{\sum_{c \in C_{BSQ}} \exp(\tau c^{T}u)}$ might need to be revised to $\hat{q}(c|u) = \frac{\exp(2 \tau c^{T}u)}{\sum_{c \in C_{BSQ}} \exp(2
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
