Linearizing Transformer with Key-Value Memory
Yizhe Zhang, Deng Cai

TL;DR
MemSizer is a novel transformer variant that combines low-rank projection and kernel-based methods to achieve linear time complexity, constant memory, and improved performance on sequence generation tasks.
Contribution
It introduces MemSizer, a new approach that enhances efficiency and accuracy of transformers by integrating low-dimensional projections with recurrent-style incremental computation.
Findings
MemSizer achieves linear inference time and constant memory.
It outperforms vanilla and other efficient transformers in translation, summarization, and language modeling.
MemSizer maintains high accuracy even with short sequence generation.
Abstract
Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Multi-Head Linear Attention · Dense Connections · Residual Connection · Softmax · Layer Normalization · Linformer
