Cascade Token Selection for Transformer Attention Acceleration

Stephen J. Thomas

arXiv:2605.03110·cs.LG·May 6, 2026

Cascade Token Selection for Transformer Attention Acceleration

Stephen J. Thomas

PDF

TL;DR

This paper introduces Cascade Token Selection, a method to accelerate transformer attention by efficiently propagating representative token sets across layers, significantly reducing computational costs while maintaining accuracy.

Contribution

It proposes a cascade mechanism that inherits and updates token sets across layers, decreasing selection complexity from quadratic to linear in token count, validated on multiple large models.

Findings

01

Gram operation savings of 22% to 63% across models.

02

High consistency of informative tokens across layers with Jaccard overlap 0.83 to 0.94.

03

The method reduces attention computation costs significantly.

Abstract

A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects $r ≪ T$ representative tokens at each layer via a Gram threshold and computes attention on the compressed $r \times r$ problem, but the selection requires a $T \times T$ Gram matrix at every layer. The cascade mechanism introduced here inherits the representative set from layer $l$ to layer $l + 1$ , validates it via a $(T - r) \times r$ cross-Gram computation, and updates it with a small number of additions and removals. The cost of the selection step drops from $O (T^{2} d)$ to $O (T r d)$ per layer. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates Gram operation savings of $22%$ to $63%$ with mean Jaccard overlap of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.