The Origin of Self-Attention: Pairwise Affinity Matrices in Feature Selection and the Emergence of Self-Attention
Giorgio Roffo

TL;DR
This paper explores the origins of self-attention in deep learning, framing it as a special case of affinity matrix-based feature selection, unifying various models through their shared reliance on pairwise relationships.
Contribution
It reveals that self-attention is a specific instance of the broader affinity-based feature selection framework, connecting it to earlier methods like Infinite Feature Selection.
Findings
Self-attention is a special case of affinity-based feature selection.
Affinity matrices can be defined through domain knowledge or learned.
The approach unifies diverse machine learning models under a common mathematical foundation.
Abstract
The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
