Block-Recurrent Dynamics in Vision Transformers
Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

TL;DR
This paper proposes the Block-Recurrent Hypothesis for Vision Transformers, demonstrating that their depth can be modeled as recurrent blocks, which simplifies understanding their dynamics and improves interpretability.
Contribution
It introduces the Raptor model to empirically validate the block-recurrent structure in ViTs and explores the dynamical interpretability of these models.
Findings
Raptor recovers 96% of DINOv2 accuracy with only 2 blocks
Recurrent structure correlates with training methods like stochastic depth
ViT dynamics show class-dependent convergence and token-specific behaviors
Abstract
As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original blocks can be accurately rewritten using only distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual perception and processing mechanisms · Advanced Vision and Imaging · Face Recognition and Perception
