Improving Transformers with Probabilistic Attention Keys

Tam Nguyen; Tan M. Nguyen; Dung D. Le; Duy Khuong Nguyen; Viet-Anh; Tran; Richard G. Baraniuk; Nhat Ho; Stanley J. Osher

arXiv:2110.08678·cs.LG·June 14, 2022·6 cites

Improving Transformers with Probabilistic Attention Keys

Tam Nguyen, Tan M. Nguyen, Dung D. Le, Duy Khuong Nguyen, Viet-Anh, Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

PDF

Open Access 1 Repo

TL;DR

This paper introduces Transformer-MGK, a novel architecture that replaces redundant attention heads with Gaussian mixture models, leading to faster, more efficient transformers with comparable or improved accuracy.

Contribution

The paper proposes Transformer-MGK, a new transformer design using Gaussian mixture keys to reduce redundancy and improve efficiency without sacrificing performance.

Findings

01

Transformer-MGK achieves comparable or better accuracy than baseline transformers.

02

It accelerates training and inference while reducing parameters and FLOPs.

03

Effective on language modeling and long sequence tasks.

Abstract

Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minhtannguyen/transformer-mgk
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding