Improving Transformers with Probabilistic Attention Keys
Tam Nguyen, Tan M. Nguyen, Dung D. Le, Duy Khuong Nguyen, Viet-Anh, Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

TL;DR
This paper introduces Transformer-MGK, a novel architecture that replaces redundant attention heads with Gaussian mixture models, leading to faster, more efficient transformers with comparable or improved accuracy.
Contribution
The paper proposes Transformer-MGK, a new transformer design using Gaussian mixture keys to reduce redundancy and improve efficiency without sacrificing performance.
Findings
Transformer-MGK achieves comparable or better accuracy than baseline transformers.
It accelerates training and inference while reducing parameters and FLOPs.
Effective on language modeling and long sequence tasks.
Abstract
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding
