Looking Beyond The Top-1: Transformers Determine Top Tokens In Order
Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel, Goldstein

TL;DR
This paper reveals that Transformers predict tokens in order of their ranking, with saturation events occurring sequentially, and introduces a new early-exit method based on this insight to improve efficiency.
Contribution
It uncovers the ordered nature of saturation events in Transformers across modalities and proposes a task transition mechanism and an effective early-exit strategy.
Findings
Saturation events occur in token ranking order.
Transformers exhibit this behavior across architectures and modalities.
The proposed early-exit method improves efficiency.
Abstract
Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Computability, Logic, AI Algorithms · Manufacturing Process and Optimization
MethodsLinear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam
