Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Yang Shanglin

TL;DR
This paper investigates why training-free token reduction methods for Vision Transformers fail at high compression, revealing inherent instability in pairwise similarity signals and proposing a diagnostic framework and a new method, CATIS, to improve stability and performance.
Contribution
The paper introduces a diagnostic framework with ranking consistency and off-diagonal correlation to analyze collapse causes and proposes CATIS, a unary signal-based method, to enhance token reduction stability.
Findings
Pairwise similarity signals degrade significantly in deep layers.
Unary signals are more stable than pairwise signals due to lower perturbation sensitivity.
CATIS achieves near-original accuracy at 63% FLOPs reduction, outperforming baselines.
Abstract
Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency and off-diagonal correlation , that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and ; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from to in deep layers. Pairwise rankings are inherently unstable ( joint perturbations) while unary signals enjoy greater stability ( perturbations, CLT). From three design principles derived from this diagnosis, we construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
