Attention with Trained Embeddings Provably Selects Important Tokens
Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

TL;DR
This paper provides a theoretical analysis showing that trained token embeddings in a simple attention model inherently identify important tokens, aligning with their frequency and predictive power, as confirmed by experiments.
Contribution
It characterizes how gradient-trained embeddings naturally encode token importance and demonstrates this behavior in a provable, simplified attention model.
Findings
Embeddings align with token importance after a single gradient step.
Softmax attention selects predictive tokens after training convergence.
Experimental results on IMDB and Yelp datasets support the theory.
Abstract
Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., , where contains the embeddings of the input sequence, is the embedding of the token and the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings capture the importance of tokens in the dataset by aligning with the output vector …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Softmax
