Lightweight Structure-Aware Attention for Visual Understanding

Heeseung Kwon; Francisco M. Castro; Manuel J. Marin-Jimenez; Nicolas Guil; Karteek Alahari

arXiv:2211.16289·cs.CV·July 4, 2025

Lightweight Structure-Aware Attention for Visual Understanding

Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari

PDF

Open Access

TL;DR

This paper introduces LiSA, a novel lightweight attention operator with log-linear complexity that enhances discriminative power by encoding structural patterns, leading to state-of-the-art results across various visual understanding tasks.

Contribution

The paper proposes a new attention operator, LiSA, which improves representation power and reduces complexity by learning structural patterns with relative position embeddings.

Findings

01

LiSA outperforms existing attention methods on ImageNet-1K.

02

LiSA achieves state-of-the-art results on Kinetics-400, COCO, and ADE-20K.

03

LiSA has log-linear computational complexity, making it efficient for large-scale tasks.

Abstract

Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called Lightweight Structure-aware Attention (LiSA), which has a better representation power with log-linear complexity. Our operator transforms the attention kernels to be more discriminative by learning structural patterns. These structural patterns are encoded by exploiting a set of relative position embeddings (RPEs) as multiplicative weights, thereby improving the representation power of the attention kernels. Additionally, the RPEs are approximated to obtain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications