Cascade Token Selection for Transformer Attention Acceleration
Stephen J. Thomas

TL;DR
This paper introduces Cascade Token Selection, a method to accelerate transformer attention by efficiently propagating representative token sets across layers, significantly reducing computational costs while maintaining accuracy.
Contribution
It proposes a cascade mechanism that inherits and updates token sets across layers, decreasing selection complexity from quadratic to linear in token count, validated on multiple large models.
Findings
Gram operation savings of 22% to 63% across models.
High consistency of informative tokens across layers with Jaccard overlap 0.83 to 0.94.
The method reduces attention computation costs significantly.
Abstract
A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects representative tokens at each layer via a Gram threshold and computes attention on the compressed problem, but the selection requires a Gram matrix at every layer. The cascade mechanism introduced here inherits the representative set from layer to layer , validates it via a cross-Gram computation, and updates it with a small number of additions and removals. The cost of the selection step drops from to per layer. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates Gram operation savings of to with mean Jaccard overlap of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
