Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao; Yuanjun Xiong; Philipp Kr\"ahenb\"uhl

arXiv:2406.07548·cs.CV·June 12, 2024

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Kr\"ahenb\"uhl

PDF

Open Access 2 Repos 1 Models 3 Reviews

TL;DR

This paper introduces Binary Spherical Quantization, a novel transformer-based tokenizer for images and videos that achieves high compression efficiency, state-of-the-art reconstruction quality, and competitive image synthesis capabilities.

Contribution

It presents BSQ, a parameter-efficient, scalable, and compact quantization method that improves visual data compression and reconstruction in transformer models.

Findings

01

Achieves state-of-the-art image and video reconstruction quality.

02

Compresses visual data by up to 100 times with minimal distortion.

03

Enables competitive image synthesis comparable to GANs and diffusion models.

Abstract

We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100 $\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4 $\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper presents an innovative quantization method (BSQ) that addresses the limitations of existing vector quantization approaches by offering a more efficient and scalable solution. 2. Extensive experiments on benchmarks such as ImageNet and UCF-101 demonstrate that BSQ-ViT significantly improves reconstruction quality, outperforming prior methods in terms of speed and fidelity. 3. The methodology is clearly explained with detailed comparisons to related work, and the theoretical basis of

Weaknesses

1. While the transformer architecture is explored, the paper does not demonstrate the effectiveness of BSQ within a CNN-based model. 2. The paper provides limited comparative data in video reconstruction, reducing the robustness of the comparison. Additionally, while block-wise causal attention is noted to impact performance, the study lacks experiments on BSQ without this causal masking. 3. The reported image and video compression results are better on MS-SSIM, potentially due to the inclusi

Reviewer 02Rating 6Confidence 5

Strengths

The Binary Spherical Quantization seems to show more effective training of the qunatization bottleneck. Analysis shows that the proposed method can provide fast speed and good performance.

Weaknesses

- Lack of comparison at different bitrate range for visual compression results. Table 4 only provides BPP, PSNR and MS-SSIM for one bitrate point. However, visual compression tasks usually require showing a Rate-Distortion curves and compare at different bitrate points. Your can use BD-Rate metric for more reasonable comparison and analyze the results at low bitrate and high bitrate. - Test settings for ablation study. Please provide more experiment setting details. In Table 5, do VQ, LFQ and B

Reviewer 03Rating 6Confidence 3

Strengths

1. The idea of projecting high-dimensional visual embeddings onto a lower-dimensional hypersphere is straightforward yet effective. 2. The motivation is clear, and the overall presentation is coherent and easy to follow. The experiments are comprehensive and provide convincing evidence to support the approach. 3. The BSQ-ViT model achieves competitive performance in diverse tasks such as image/video reconstruction, generation, and compression.

Weaknesses

1. This method uses a transformer encoder and decoder, which limit the flexibility in resolution. How do the authors address this issue? 2. For image and video compression results, it would be beneficial to include an LPIPS comparison to assess perceptual performance. 3. There are a few minor issues. 1) In Eq. (7), $\hat{q}(c|u) = \frac{\exp(\tau c^{T}u)}{\sum_{c \in C_{BSQ}} \exp(\tau c^{T}u)}$ might need to be revised to $\hat{q}(c|u) = \frac{\exp(2 \tau c^{T}u)}{\sum_{c \in C_{BSQ}} \exp(2

Code & Models

Repositories

Models

🤗
GrayShine/WeTok
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection