LoopViT: Scaling Visual ARC with Looped Transformers
Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, Harry Yang

TL;DR
LoopViT introduces a recursive vision transformer architecture with weight-tied recurrence and a dynamic exit mechanism, enabling more efficient and scalable visual reasoning that outperforms larger models on the ARC-AGI benchmark.
Contribution
It proposes LoopViT, a novel recursive transformer with weight tying and a dynamic halting mechanism, decoupling reasoning depth from model capacity for improved efficiency.
Findings
LoopViT achieves 65.8% accuracy on ARC-AGI-1 with only 18M parameters.
It outperforms larger 73M-parameter ensembles on the same benchmark.
Adaptive iterative computation proves more scalable than increasing network width.
Abstract
Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Explainable Artificial Intelligence (XAI)
