The Origin of Self-Attention: Pairwise Affinity Matrices in Feature Selection and the Emergence of Self-Attention

Giorgio Roffo

arXiv:2507.14560·cs.LG·July 29, 2025

The Origin of Self-Attention: Pairwise Affinity Matrices in Feature Selection and the Emergence of Self-Attention

Giorgio Roffo

PDF

TL;DR

This paper explores the origins of self-attention in deep learning, framing it as a special case of affinity matrix-based feature selection, unifying various models through their shared reliance on pairwise relationships.

Contribution

It reveals that self-attention is a specific instance of the broader affinity-based feature selection framework, connecting it to earlier methods like Infinite Feature Selection.

Findings

01

Self-attention is a special case of affinity-based feature selection.

02

Affinity matrices can be defined through domain knowledge or learned.

03

The approach unifies diverse machine learning models under a common mathematical foundation.

Abstract

The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.