bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

Michal Byra; Pawel Olszowiec; Grzegorz Stefanski; Grzegorz Gruszczynski; Alberto Presta

arXiv:2605.10661·cs.CV·May 12, 2026

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta

PDF

TL;DR

This paper introduces bViT, a recurrent vision transformer model that uses a single block repeatedly, achieving comparable accuracy to standard ViTs with fewer parameters and revealing insights into depth and recurrence in vision models.

Contribution

The study demonstrates that a single recurrent block can replace multiple layers in ViTs, especially when the model is sufficiently wide, offering a parameter-efficient alternative.

Findings

01

bViT achieves comparable accuracy to standard ViT with fewer parameters.

02

Wider bViTs recover more performance than narrow variants.

03

Shared block behavior evolves across recurrent steps, not just repeating.

Abstract

Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.