Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

Wenhao Li; Yudong Xu; Scott Sanner; Elias Boutros Khalil

arXiv:2410.06405·cs.CV·July 17, 2025

Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

Wenhao Li, Yudong Xu, Scott Sanner, Elias Boutros Khalil

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the limitations of standard Vision Transformers on the ARC benchmark and introduces ViTARC, a modified architecture with inductive biases that achieves near-perfect results on many tasks, advancing visual reasoning AI.

Contribution

The paper identifies the representational shortcomings of vanilla ViT for ARC tasks and proposes ViTARC, a tailored architecture with novel encoding schemes that significantly improves reasoning performance.

Findings

01

Vanilla ViT fails on most ARC tasks even with extensive training.

02

ViTARC achieves close to 100% solve rate on over half of the ARC tasks.

03

Inductive biases are crucial for effective visual reasoning with transformers.

Abstract

The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires solving a program synthesis problem over small 2D images using a few input-output training pairs. In this work, we adopt the recently popular data-driven approach to the ARC and ask whether a Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that underlies the task. We show that a ViT -- otherwise a state-of-the-art model for images -- fails dramatically on most ARC tasks even when trained on one million examples per task. This points to an inherent representational deficiency of the ViT architecture that makes it incapable of uncovering the simple structured mappings underlying the ARC tasks. Building on these insights, we propose ViTARC, a ViT-style…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

1. The Abstract Visual Reasoning (AVR) tasks is interesting and important to study because it requires strong reasoning capability of ViTs. The paper also has very interesting findings, highlighting the importance of positional encodings in solving pure vision-based visual reasoning tasks. 2. This works provides very detailed model improvements they have tried to improve the performance, from ViT-Vanilla to ViTARC-VT and ViTARC. This is very useful to the communities to reproduce the experiments

Weaknesses

1. The training/evaluation protocol is not clearly defined. The paper does not show clearly the generalization ability on unseen tasks. All the task they use for evaluation have some training examples during training. It would be very interesting to use some tasks pure for evaluation which are not seen during training. 2. Some of the key techniques used in this work are not new, like 2D (Relative) Positional Encoding, which have been discussed in the original ViT/Swin Transformer papers and play

Reviewer 02Rating 5Confidence 4

Strengths

1 - The paper addresses an interesting question, which is: if the tasks in ARC are visual tasks, how can se use the current vision tools to deal with them? 2 - The reasoning behind every contribution in the paper is well explained. The paper is easy to follow. 3 - Related to the previous point, I particularly like the analysis in Figure 6. It helps understand what the model is (not) paying attention to. 4 - The results in the paper show a clear improvement with respect to the original ViT v

Weaknesses

The paper has some weaknesses that I believe can be addressed, but also should be addressed. **1 - Unclear that this is vision modeling**. The main argument of the paper is that vision transformers should be improved to work on ARC. But the proposed changes make the model not be a vision model anymore. While the paper mentions that "Transformer-based LLM approaches convert the images into strings, which does not fully capture all relevant structural information”, this is not too different from

Reviewer 03Rating 5Confidence 4

Strengths

I appreciate that the authors explore the limits of a data-driven approach to ARC, as well as propose potential inductive biases to encode into a reasoning model for ARC. General priors for reasoning tasks are indeed important. The quantitative results compared to a naive ViT are promising for ARC.

Weaknesses

1. Many of the architectural designs in the proposed model are made for solving ARC specifically. I believe ARC is a great intermediate proxy task for complex reasoning, but should not be an end goal in and of itself. With enough inductive biases, I believe that solving ARC with a million examples is reasonable, but is not particularly enlightening for the community. For example, 2D padding with <arc_pad> tokens and border tokens <arc endxgrid> that define grid conditions, etc, are very much def

Code & Models

Repositories

khalil-research/ViTARC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings