Leveraging redundancy in attention with Reuse Transformers

Srinadh Bhojanapalli; Ayan Chakrabarti; Andreas Veit; Michal Lukasik,; Himanshu Jain; Frederick Liu; Yin-Wen Chang; Sanjiv Kumar

arXiv:2110.06821·cs.LG·October 14, 2021

Leveraging redundancy in attention with Reuse Transformers

Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik,, Himanshu Jain, Frederick Liu, Yin-Wen Chang, Sanjiv Kumar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel Transformer architecture that reuses attention scores across layers, reducing computation and memory while maintaining or improving performance.

Contribution

It systematically analyzes attention score redundancy and proposes reusing scores across layers to enhance efficiency without sacrificing accuracy.

Findings

01

Redundancy in attention scores across layers and heads.

02

Reusing attention scores maintains or improves performance.

03

Significant reduction in compute and memory usage.

Abstract

Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similarity of these scores across heads and layers and find them to be considerably redundant, especially adjacent layers showing high similarity. Motivated by these findings, we propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers. Experiments on a number of standard benchmarks show that reusing attention delivers performance equivalent to or better than standard transformers, while reducing both compute and memory usage.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tensorflow/models/blob/master/official/nlp/modeling/layers/reuse_transformer.py
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Label Smoothing · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Softmax · Dropout