Associative Transformer
Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai

TL;DR
The Associative Transformer introduces a memory-augmented sparse attention mechanism that enhances relational reasoning and parameter efficiency in vision tasks, outperforming existing sparse Transformer models.
Contribution
It proposes a novel associative memory-based attention method with explicit learnable priors, improving efficiency and performance over prior sparse Transformers.
Findings
AiT requires fewer parameters and layers than comparable models.
AiT outperforms state-of-the-art sparse Transformers on relational reasoning tasks.
AiT demonstrates superior performance in vision classification and reasoning benchmarks.
Abstract
Emerging from the pairwise attention in conventional Transformers, there is a growing interest in sparse attention mechanisms that align more closely with localized, contextual learning in the biological brain. Existing studies such as the Coordination method employ iterative cross-attention mechanisms with a bottleneck to enable the sparse association of inputs. However, these methods are parameter inefficient and fail in more complex relational reasoning tasks. To this end, we propose Associative Transformer (AiT) to enhance the association among sparsely attended input tokens, improving parameter efficiency and performance in various vision tasks such as classification and relational reasoning. AiT leverages a learnable explicit memory comprising specialized priors that guide bottleneck attentions to facilitate the extraction of diverse localized tokens. Moreover, AiT employs an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Layer Normalization · Label Smoothing · Set Transformer · Byte Pair Encoding · Dropout
