Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal; Sharath Chandra Raparthy; Irina Rish; Yoshua Bengio; and Guillaume Lajoie

arXiv:2110.09419·cs.LG·February 15, 2022·1 cites

Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal, Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, and Guillaume Lajoie

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces Compositional Attention, a novel mechanism that separates search and retrieval in attention heads, enhancing flexibility, reducing redundancy, and improving performance on various tasks, including out-of-distribution scenarios.

Contribution

It proposes a new attention mechanism that disentangles search and retrieval, allowing dynamic composition and better generalization compared to standard multi-head attention.

Findings

01

Outperforms standard attention on multiple tasks

02

Enables dynamic specialization based on retrieval type

03

Generalizes multi-head attention with independent scaling

Abstract

Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Compositional Attention: Disentangling Search and Retrieval· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Graph Neural Networks

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding