Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
Wenhao Li, Yudong Xu, Scott Sanner, Elias Boutros Khalil

TL;DR
This paper investigates the limitations of standard Vision Transformers on the ARC benchmark and introduces ViTARC, a modified architecture with inductive biases that achieves near-perfect results on many tasks, advancing visual reasoning AI.
Contribution
The paper identifies the representational shortcomings of vanilla ViT for ARC tasks and proposes ViTARC, a tailored architecture with novel encoding schemes that significantly improves reasoning performance.
Findings
Vanilla ViT fails on most ARC tasks even with extensive training.
ViTARC achieves close to 100% solve rate on over half of the ARC tasks.
Inductive biases are crucial for effective visual reasoning with transformers.
Abstract
The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires solving a program synthesis problem over small 2D images using a few input-output training pairs. In this work, we adopt the recently popular data-driven approach to the ARC and ask whether a Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that underlies the task. We show that a ViT -- otherwise a state-of-the-art model for images -- fails dramatically on most ARC tasks even when trained on one million examples per task. This points to an inherent representational deficiency of the ViT architecture that makes it incapable of uncovering the simple structured mappings underlying the ARC tasks. Building on these insights, we propose ViTARC, a ViT-style…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The Abstract Visual Reasoning (AVR) tasks is interesting and important to study because it requires strong reasoning capability of ViTs. The paper also has very interesting findings, highlighting the importance of positional encodings in solving pure vision-based visual reasoning tasks. 2. This works provides very detailed model improvements they have tried to improve the performance, from ViT-Vanilla to ViTARC-VT and ViTARC. This is very useful to the communities to reproduce the experiments
1. The training/evaluation protocol is not clearly defined. The paper does not show clearly the generalization ability on unseen tasks. All the task they use for evaluation have some training examples during training. It would be very interesting to use some tasks pure for evaluation which are not seen during training. 2. Some of the key techniques used in this work are not new, like 2D (Relative) Positional Encoding, which have been discussed in the original ViT/Swin Transformer papers and play
1 - The paper addresses an interesting question, which is: if the tasks in ARC are visual tasks, how can se use the current vision tools to deal with them? 2 - The reasoning behind every contribution in the paper is well explained. The paper is easy to follow. 3 - Related to the previous point, I particularly like the analysis in Figure 6. It helps understand what the model is (not) paying attention to. 4 - The results in the paper show a clear improvement with respect to the original ViT v
The paper has some weaknesses that I believe can be addressed, but also should be addressed. **1 - Unclear that this is vision modeling**. The main argument of the paper is that vision transformers should be improved to work on ARC. But the proposed changes make the model not be a vision model anymore. While the paper mentions that "Transformer-based LLM approaches convert the images into strings, which does not fully capture all relevant structural information”, this is not too different from
I appreciate that the authors explore the limits of a data-driven approach to ARC, as well as propose potential inductive biases to encode into a reasoning model for ARC. General priors for reasoning tasks are indeed important. The quantitative results compared to a naive ViT are promising for ARC.
1. Many of the architectural designs in the proposed model are made for solving ARC specifically. I believe ARC is a great intermediate proxy task for complex reasoning, but should not be an end goal in and of itself. With enough inductive biases, I believe that solving ARC with a million examples is reasonable, but is not particularly enlightening for the community. For example, 2D padding with <arc_pad> tokens and border tokens <arc endxgrid> that define grid conditions, etc, are very much def
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
