Cascaded Head-colliding Attention

Lin Zheng; Zhiyong Wu; Lingpeng Kong

arXiv:2105.14850·cs.CL·June 1, 2021·1 cites

Cascaded Head-colliding Attention

Lin Zheng, Zhiyong Wu, Lingpeng Kong

PDF

Open Access 1 Repo

TL;DR

This paper introduces CODA, a hierarchical variational model that explicitly captures interactions among attention heads in Transformers, leading to improved parameter efficiency and better performance on language modeling and translation tasks.

Contribution

It reformulates multi-head attention as a probabilistic latent variable model and models head interactions explicitly, which is a novel approach to enhance Transformer efficiency.

Findings

01

Outperforms baseline by 0.6 perplexity on Wikitext-103

02

Achieves 0.6 BLEU improvement on WMT14 EN-DE

03

Demonstrates enhanced parameter efficiency in Transformer models

Abstract

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LZhengisme/CODA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Byte Pair Encoding · Residual Connection · Dropout