Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Qingchuan Ma; Yuhang Wu; Xiawu Zheng; Rongrong Ji

arXiv:2505.23833·cs.CL·June 2, 2025

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a theoretically grounded benchmark with novel metrics to evaluate and analyze the abstract reasoning capabilities of large language models, revealing significant limitations and areas for improvement.

Contribution

It develops a mathematical framework and two new metrics, b3 and d, for assessing genuine abstraction versus memorization in LLMs, along with a systematic symbol remapping benchmark.

Findings

01

LLMs show limitations in non-decimal arithmetic and symbolic reasoning

02

Persistent gaps in abstraction despite chain-of-thought prompting

03

b3d effectively measures memory dependence and pattern recognition

Abstract

In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: \(\scoreGamma\) measures basic reasoning accuracy, while \(\scoreDelta\) quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mac-automl/abstract-reason-benchmark
pytorchOfficial

Videos

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective· slideslive

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics Education and Teaching Techniques · Teaching and Learning Programming