PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Weikang Meng; Yadan Luo; Xin Li; Dongmei Jiang; Zheng Zhang

arXiv:2501.15061·cs.CV·March 5, 2025·3 cites

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, Zheng Zhang

PDF

Open Access 3 Reviews

TL;DR

PolaFormer introduces a polarity-aware linear attention mechanism for vision transformers that explicitly models both positive and negative query-key interactions, reducing information loss and improving performance and efficiency.

Contribution

It proposes a novel polarity-aware linear attention method with a learnable rescaling function, addressing information loss in existing linear attention models.

Findings

01

Improves vision transformer performance by up to 4.6%.

02

Enhances attention map expressiveness and reduces entropy.

03

Provides theoretical analysis for entropy reduction in attention maps.

Abstract

Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs, we propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we provide a theoretical analysis proving the existence of a class of element-wise functions (with…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

S1. The method presented in this paper is simple and well justified S2. Paper generally easy to read S3. results outperform the full attention baseline (I merely expected it to be a good approximation)

Weaknesses

W1. The main dawback of the experiments is that there is not experiment on truly high resolution image, where this method would fully benefit from the linear complexity W2. Some parts are not clear (see below) Unclear points: - it would be useful to specify from the intro that the method is applied with full training -- it is unclear until the experiments that this is not fine-tuning and not a drop-in replacement for softmax attention at inference time L133: better = more accurate or faste

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper is well written and organized. 2. The paper provides solid motivation for the method and the proposed approach is well justified. 3. The experimental results show promising performance.

Weaknesses

1. After reading the paper, I am still not sure how the non-negativity constraint is perserved. Especially when a learnable matrix $G$ is applied. Could authors provide more explanation on how the learnable matrix $G$ can perserve the non-negativity constraint? 2. Some latest baselines are missing in the paper. For instance, authors should consider incorporating the latest work [1] for comparsion. This baseline also proposes a new linear self-attention to achieve both high expressiveness capaci

Reviewer 03Rating 6Confidence 4

Strengths

1. The key strength is the introduction of polarity-aware attention. By decomposing the query and key vectors into positive and negative components (as in Equation (3)), they capture all possible interactions: positive-positive, negative-negative, positive-negative, and negative-positive. This is a significant departure from traditional methods that only consider positive interactions due to non-negative feature maps. 2. They provide solid theoretical analysis. For example, in Theorem 1, they pr

Weaknesses

1. Introducing a learnable power function and additional parameters like the polarity coefficients $ G_s $ and $ G_o $ could introduce training challenges, such as sensitivity to initialization or convergence issues. The paper doesn't discuss whether they encountered any of these problems or how they addressed them. 2. Since attention mechanisms are also fundamental in NLP, it would have been interesting to see PolaFormer's performance on language tasks. The paper focuses solely on vision ta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsSoftmax · Attention Is All You Need