DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention
Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang

TL;DR
This paper introduces DBA, a dynamic bilinear low-rank attention mechanism that adaptively compresses sequence length and optimizes hidden state dimensions, achieving efficient linear complexity while maintaining high performance.
Contribution
The paper proposes a novel attention mechanism that dynamically adjusts projection matrices based on input, addressing limitations of fixed projections in prior low-rank Transformers.
Findings
Achieves linear time and space complexity.
Maintains state-of-the-art performance on diverse tasks.
Reduces memory consumption and increases speed.
Abstract
Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position. Adopting such input-invariant projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Advanced Neural Network Applications · Face and Expression Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Layer Normalization · Absolute Position Encodings · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing
