Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Daria Lioubashevski; Tomer Schlank; Gabriel Stanovsky; Ariel; Goldstein

arXiv:2410.20210·cs.CL·October 29, 2024

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel, Goldstein

PDF

Open Access 1 Repo

TL;DR

This paper reveals that Transformers predict tokens in order of their ranking, with saturation events occurring sequentially, and introduces a new early-exit method based on this insight to improve efficiency.

Contribution

It uncovers the ordered nature of saturation events in Transformers across modalities and proposes a task transition mechanism and an effective early-exit strategy.

Findings

01

Saturation events occur in token ranking order.

02

Transformers exhibit this behavior across architectures and modalities.

03

The proposed early-exit method improves efficiency.

Abstract

Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

daria-lioubashevski/beyond_top1
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Computability, Logic, AI Algorithms · Manufacturing Process and Optimization

MethodsLinear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam