TL;DR
This paper introduces A2RBench, an automated, scalable benchmark generation pipeline for evaluating the abstract reasoning abilities of large language models, addressing limitations of manual annotation and memorization.
Contribution
It proposes a novel automated framework combining generation, expansion, and verification, supported by a theoretical cycle consistency verification to ensure task validity.
Findings
Current LLMs perform poorly on abstract reasoning tasks compared to humans.
LLMs generate less complex 3D tasks, indicating limited understanding of high-dimensional reasoning.
Higher input complexity can sometimes simplify reasoning processes in LLMs.
Abstract
Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
