Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

arXiv:2309.16354·cs.LG·February 27, 2024·2 cites

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Transformer-VQ introduces a linear-time decoder-only transformer using vector quantization and caching, achieving high-quality results and significant speedups over traditional quadratic transformers on large-scale tasks.

Contribution

It presents a novel linear-time transformer architecture with vector-quantized keys and caching, enabling efficient processing of very long sequences.

Findings

01

Achieves 0.99 bpb on Enwik8 and 26.6 ppl on PG-19.

02

Over 3x faster than quadratic transformers at sequence length 8k.

03

Scales efficiently to sequence length 131k with similar throughput.

Abstract

We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

* The idea of using vector-quantization to achieve linear attention sounds reasonable. * The paper includes a detailed discussion of various related works and how they differ from Transformer-VQ. * Code is provided, aiding reproducability. * The paper provides qualitative samples of the various trained models.

Weaknesses

* I think the presentation of the paper could be improved. I think some pseudocode or diagrams illustrating the main ideas would be useful, while some of the theorems could potentially be moved to the Appendix. * It seems to me that Transformer-VQ only significantly outperforms prior work on ImageNet64 where it uses a 7x larger model than the second best model. * It also not entirely clear to me what the real-world advantage of Transformer-VQ is, I assume that the attention mechanism is signific

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

* Practicality: Instructions on how to implement the architecture is very detailed, and it looks like the authors experimented quite a lot to design an architecture that works well with various architecture sizes (190M, 1.2B, 1.3B parameters) * Large-scale experiments across various challenging tasks.

Weaknesses

The authors experiment with various architecture sizes; how did the authors choose which parameters to scale up? Is there some "scaling laws" that the authors observed to be useful? For example, fixing the codebook size to 256 may restrict the model's expressiveness which can be extremely critical for generative models. Given that the authors are emphasizing a decoder model which is particularly useful for generative tasks, I believe this paper would benefit from analysis on this new parameter's

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

* While the idea of adopting VQ into Transformers is not entirely new (ex: Clustered Attention), the paper’s innovation is in using VQ for Keys and demonstrating its effectiveness (especially for Decoder). This is a nice contribution. * The paper provides a comprehensive review of related works, addressing various aspects. Furthermore, the equations and notations are presented clearly to help clarify the core idea.

Weaknesses

* Although this paper focuses on efficient computation, there is no direct comparison of FLOPs or actual inference latency on GPUs/TPUs with previous works. It would also be beneficial to include a comparison between models with and without VQ. * Only one model size (1.3B for PG-19, 1.2B for ImageNet64) is used for experiments. To demonstrate the effectiveness, it would be valuable to test with multiple (smaller) model sizes where the sequence length exceeds the hidden dimension of Transformers.

Code & Models

Repositories

transformer-vq/transformer_vq
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Image Retrieval and Classification Techniques