A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Qingchuan Ma; Yuexiao Ma; Yongkang Xie; Tianyu Xie; Xiawu Zheng; and Rongrong Ji

arXiv:2605.17278·cs.AI·May 19, 2026

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Qingchuan Ma, Yuexiao Ma, Yongkang Xie, Tianyu Xie, Xiawu Zheng, and Rongrong Ji

PDF

1 Repo

TL;DR

This paper introduces A2RBench, an automated, scalable benchmark generation pipeline for evaluating the abstract reasoning abilities of large language models, addressing limitations of manual annotation and memorization.

Contribution

It proposes a novel automated framework combining generation, expansion, and verification, supported by a theoretical cycle consistency verification to ensure task validity.

Findings

01

Current LLMs perform poorly on abstract reasoning tasks compared to humans.

02

LLMs generate less complex 3D tasks, indicating limited understanding of high-dimensional reasoning.

03

Higher input complexity can sometimes simplify reasoning processes in LLMs.

Abstract

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mac-automl/A2Rbench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.