LoopViT: Scaling Visual ARC with Looped Transformers

Wen-Jie Shu; Xuerui Qiu; Rui-Jie Zhu; Harold Haodong Chen; Yexin Liu; Harry Yang

arXiv:2602.02156·cs.CV·February 3, 2026

LoopViT: Scaling Visual ARC with Looped Transformers

Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, Harry Yang

PDF

Open Access

TL;DR

LoopViT introduces a recursive vision transformer architecture with weight-tied recurrence and a dynamic exit mechanism, enabling more efficient and scalable visual reasoning that outperforms larger models on the ARC-AGI benchmark.

Contribution

It proposes LoopViT, a novel recursive transformer with weight tying and a dynamic halting mechanism, decoupling reasoning depth from model capacity for improved efficiency.

Findings

01

LoopViT achieves 65.8% accuracy on ARC-AGI-1 with only 18M parameters.

02

It outperforms larger 73M-parameter ensembles on the same benchmark.

03

Adaptive iterative computation proves more scalable than increasing network width.

Abstract

Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Explainable Artificial Intelligence (XAI)