Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs; Thomas Fel; Richard Hakim; Alessandra Brondetta; Demba Ba; T. Andy Keller

arXiv:2512.19941·cs.CV·March 18, 2026

Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

PDF

Open Access

TL;DR

This paper proposes the Block-Recurrent Hypothesis for Vision Transformers, demonstrating that their depth can be modeled as recurrent blocks, which simplifies understanding their dynamics and improves interpretability.

Contribution

It introduces the Raptor model to empirically validate the block-recurrent structure in ViTs and explores the dynamical interpretability of these models.

Findings

01

Raptor recovers 96% of DINOv2 accuracy with only 2 blocks

02

Recurrent structure correlates with training methods like stochastic depth

03

ViT dynamics show class-dependent convergence and token-specific behaviors

Abstract

As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k ≪ L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual perception and processing mechanisms · Advanced Vision and Imaging · Face Recognition and Perception