Transformer-VQ: Linear-Time Transformers via Vector Quantization
Lucas D. Lingle

TL;DR
Transformer-VQ introduces a linear-time decoder-only transformer using vector quantization and caching, achieving high-quality results and significant speedups over traditional quadratic transformers on large-scale tasks.
Contribution
It presents a novel linear-time transformer architecture with vector-quantized keys and caching, enabling efficient processing of very long sequences.
Findings
Achieves 0.99 bpb on Enwik8 and 26.6 ppl on PG-19.
Over 3x faster than quadratic transformers at sequence length 8k.
Scales efficiently to sequence length 131k with similar throughput.
Abstract
We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}
Peer Reviews
Decision·ICLR 2024 poster
* The idea of using vector-quantization to achieve linear attention sounds reasonable. * The paper includes a detailed discussion of various related works and how they differ from Transformer-VQ. * Code is provided, aiding reproducability. * The paper provides qualitative samples of the various trained models.
* I think the presentation of the paper could be improved. I think some pseudocode or diagrams illustrating the main ideas would be useful, while some of the theorems could potentially be moved to the Appendix. * It seems to me that Transformer-VQ only significantly outperforms prior work on ImageNet64 where it uses a 7x larger model than the second best model. * It also not entirely clear to me what the real-world advantage of Transformer-VQ is, I assume that the attention mechanism is signific
* Practicality: Instructions on how to implement the architecture is very detailed, and it looks like the authors experimented quite a lot to design an architecture that works well with various architecture sizes (190M, 1.2B, 1.3B parameters) * Large-scale experiments across various challenging tasks.
The authors experiment with various architecture sizes; how did the authors choose which parameters to scale up? Is there some "scaling laws" that the authors observed to be useful? For example, fixing the codebook size to 256 may restrict the model's expressiveness which can be extremely critical for generative models. Given that the authors are emphasizing a decoder model which is particularly useful for generative tasks, I believe this paper would benefit from analysis on this new parameter's
* While the idea of adopting VQ into Transformers is not entirely new (ex: Clustered Attention), the paper’s innovation is in using VQ for Keys and demonstrating its effectiveness (especially for Decoder). This is a nice contribution. * The paper provides a comprehensive review of related works, addressing various aspects. Furthermore, the equations and notations are presented clearly to help clarify the core idea.
* Although this paper focuses on efficient computation, there is no direct comparison of FLOPs or actual inference latency on GPUs/TPUs with previous works. It would also be beneficial to include a comparison between models with and without VQ. * Only one model size (1.3B for PG-19, 1.2B for ImageNet64) is used for experiments. To demonstrate the effectiveness, it would be valuable to test with multiple (smaller) model sizes where the sequence length exceeds the hidden dimension of Transformers.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Image Retrieval and Classification Techniques
