bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta

TL;DR
This paper introduces bViT, a recurrent vision transformer model that uses a single block repeatedly, achieving comparable accuracy to standard ViTs with fewer parameters and revealing insights into depth and recurrence in vision models.
Contribution
The study demonstrates that a single recurrent block can replace multiple layers in ViTs, especially when the model is sufficiently wide, offering a parameter-efficient alternative.
Findings
bViT achieves comparable accuracy to standard ViT with fewer parameters.
Wider bViTs recover more performance than narrow variants.
Shared block behavior evolves across recurrent steps, not just repeating.
Abstract
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
