Agent Attention: On the Integration of Softmax and Linear Attention
Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei, Wan, Shiji Song, Gao Huang

TL;DR
This paper introduces Agent Attention, a novel attention mechanism that balances efficiency and expressiveness by integrating softmax and linear attention, demonstrating superior performance across vision tasks and high-resolution scenarios.
Contribution
The paper proposes Agent Attention, a new attention paradigm that combines the strengths of softmax and linear attention, improving efficiency while maintaining global context modeling.
Findings
Agent Attention outperforms traditional softmax attention in efficiency.
It achieves comparable or better accuracy in vision tasks.
Significantly accelerates image generation in high-resolution scenarios.
Abstract
The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple , introduces an additional set of agent tokens into the conventional attention module. The agent tokens first act as the agent for the query tokens to aggregate information from and , and then broadcast the information back to . Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning
MethodsSparse Evolutionary Training · Diffusion · Softmax
