Compositional Attention: Disentangling Search and Retrieval
Sarthak Mittal, Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, and Guillaume Lajoie

TL;DR
This paper introduces Compositional Attention, a novel mechanism that separates search and retrieval in attention heads, enhancing flexibility, reducing redundancy, and improving performance on various tasks, including out-of-distribution scenarios.
Contribution
It proposes a new attention mechanism that disentangles search and retrieval, allowing dynamic composition and better generalization compared to standard multi-head attention.
Findings
Outperforms standard attention on multiple tasks
Enables dynamic specialization based on retrieval type
Generalizes multi-head attention with independent scaling
Abstract
Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Graph Neural Networks
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding
