Attention with Trained Embeddings Provably Selects Important Tokens

Diyuan Wu; Aleksandr Shevchenko; Samet Oymak; Marco Mondelli

arXiv:2505.17282·cs.LG·June 26, 2025

Attention with Trained Embeddings Provably Selects Important Tokens

Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

PDF

TL;DR

This paper provides a theoretical analysis showing that trained token embeddings in a simple attention model inherently identify important tokens, aligning with their frequency and predictive power, as confirmed by experiments.

Contribution

It characterizes how gradient-trained embeddings naturally encode token importance and demonstrates this behavior in a provable, simplified attention model.

Findings

01

Embeddings align with token importance after a single gradient step.

02

Softmax attention selects predictive tokens after training convergence.

03

Experimental results on IMDB and Yelp datasets support the theory.

Abstract

Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $Softmax (p^{⊤} E_{X}^{⊤}) E_{X} v = \frac{\sum _{i = 1}^{T} e x p ( p ^{⊤} E _{x_{i}} ) E _{x_{i}}^{⊤} v}{\sum _{j = 1}^{T} e x p ( p ^{⊤} E _{x_{j}} )}$ , where $E_{X} = [E_{x_{1}}, \dots, E_{x_{T}}]^{⊤}$ contains the embeddings of the input sequence, $p$ is the embedding of the $⟨ cls ⟩$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_{X}$ capture the importance of tokens in the dataset by aligning with the output vector $v$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Softmax