Cascaded Head-colliding Attention
Lin Zheng, Zhiyong Wu, Lingpeng Kong

TL;DR
This paper introduces CODA, a hierarchical variational model that explicitly captures interactions among attention heads in Transformers, leading to improved parameter efficiency and better performance on language modeling and translation tasks.
Contribution
It reformulates multi-head attention as a probabilistic latent variable model and models head interactions explicitly, which is a novel approach to enhance Transformer efficiency.
Findings
Outperforms baseline by 0.6 perplexity on Wikitext-103
Achieves 0.6 BLEU improvement on WMT14 EN-DE
Demonstrates enhanced parameter efficiency in Transformer models
Abstract
Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by perplexity on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Byte Pair Encoding · Residual Connection · Dropout
