Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

TL;DR
This paper introduces a theoretically grounded benchmark with novel metrics to evaluate and analyze the abstract reasoning capabilities of large language models, revealing significant limitations and areas for improvement.
Contribution
It develops a mathematical framework and two new metrics, b3 and d, for assessing genuine abstraction versus memorization in LLMs, along with a systematic symbol remapping benchmark.
Findings
LLMs show limitations in non-decimal arithmetic and symbolic reasoning
Persistent gaps in abstraction despite chain-of-thought prompting
b3d effectively measures memory dependence and pattern recognition
Abstract
In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: \(\scoreGamma\) measures basic reasoning accuracy, while \(\scoreDelta\) quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics Education and Teaching Techniques · Teaching and Learning Programming
