Mechanisms of Non-Monotonic Scaling in Vision Transformers
Anantha Padmanaban Krishna Kumar (Boston University)

TL;DR
This paper investigates why deeper Vision Transformers sometimes underperform, revealing a three-phase pattern in their representation evolution and proposing the Information Scrambling Index as a diagnostic tool for understanding and improving model depth effects.
Contribution
It introduces a systematic empirical analysis of Vision Transformers, identifying a three-phase pattern in their behavior and proposing the Information Scrambling Index as a new diagnostic measure.
Findings
Deeper ViTs follow a Cliff-Plateau-Climb pattern in representation evolution.
Better performance correlates with the marginalization of the [CLS] token.
Increased layers in ViT-L lead to more information diffusion, not better task performance.
Abstract
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Face Recognition and Perception
