Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain

Hyowon Wi; Jeongwhan Choi; Noseong Park

arXiv:2505.08516·cs.LG·May 14, 2025

Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain

Hyowon Wi, Jeongwhan Choi, Noseong Park

PDF

TL;DR

This paper introduces AGF, a novel self-attention mechanism interpreted as a graph filter in the singular value domain, enabling more effective frequency information utilization and achieving state-of-the-art results in long-range and time series tasks.

Contribution

The paper proposes AGF, a new self-attention method that models graph filters in the singular value domain, enhancing frequency information leverage in linear transformers.

Findings

01

AGF achieves state-of-the-art performance on Long Range Arena benchmark.

02

AGF demonstrates superior results in time series classification.

03

The method maintains linear complexity with respect to input length.

Abstract

Transformers have demonstrated remarkable performance across diverse domains. The key component of Transformers is self-attention, which learns the relationship between any two tokens in the input sequence. Recent studies have revealed that the self-attention can be understood as a normalized adjacency matrix of a graph. Notably, from the perspective of graph signal processing (GSP), the self-attention can be equivalently defined as a simple graph filter, applying GSP using the value vector as the signal. However, the self-attention is a graph filter defined with only the first order of the polynomial matrix, and acts as a low-pass filter preventing the effective leverage of various frequency information. Consequently, existing self-attention mechanisms are designed in a rather simplified manner. Therefore, we propose a novel method, called \underline{\textbf{A}}ttentive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.