
TL;DR
This paper proposes that reasoning functions as a distinct modality separate from the workspace, introducing a role-separated transformer that improves visual reasoning performance on ARC tasks beyond human accuracy.
Contribution
The paper introduces a novel role-separated transformer architecture that explicitly models reasoning as a separate modality, achieving state-of-the-art results on the ARC benchmark.
Findings
Achieved 62.6% accuracy on ARC-1, surpassing human performance.
Model exhibits more coherent rule-application structure than baseline.
Supports hypothesis that reasoning is a distinct modality.
Abstract
The Abstraction and Reasoning Corpus (ARC) provides a compact laboratory for studying abstract reasoning, an ability central to human intelligence. Modern AI systems, including LLMs and ViTs, largely operate as sequence-of-behavior prediction machines: they match observable behaviors by modeling token statistics without a persistent, readable mental state. This creates a gap with human-like behavior: humans can explain an action by decoding internal state, while AI systems can produce fluent post-hoc rationalizations that are not grounded in such a state. We hypothesize that reasoning is a modality: reasoning should exist as a distinct channel separate from the low-level workspace on which rules are applied. To test this hypothesis, on solving ARC tasks as a visual reasoning problem, we designed a novel role-separated transformer block that splits global controller tokens from grid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices
