Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Yue Yang; MingKang Chen; Qihua Liu; Mengkang Hu; Qiguang Chen; Gengrui Zhang; Shuyue Hu; Guangtao Zhai; Yu Qiao; Yu Wang; Wenqi Shao; Ping Luo

arXiv:2506.02648·cs.AI·September 30, 2025

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DRE-Bench, a hierarchical, dynamic reasoning benchmark designed to evaluate the fluid intelligence of large language models, revealing current models' limitations in high-level abstract reasoning and generalization.

Contribution

The paper presents DRE-Bench, a novel, interpretable benchmark with dynamic variants for assessing fluid intelligence in LLMs, addressing limitations of existing reasoning tests.

Findings

01

LLMs perform well on low-level reasoning tasks

02

Models struggle with high-level cognition and generalization

03

Current LLMs exhibit limited fluid intelligence compared to humans

Abstract

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The task design is well aligned with cognitive psychology, clearly reflecting different types of reasoning and the varying levels of intelligence required for each task. 2. The proposed benchmark is a valuable contribution to the evaluation community, offering diverse and controllable challenging tasks. 3. The paper provides a thorough evaluation and analysis across multiple LLMs, shwoing how accuracy and stability vary with task complexity and cognitive level.

Weaknesses

While the paper proposes a cognitively inspired hierarchy of reasoning tasks, it does not measure the time humans take to solve them. Given the small-scale human evaluation and potential variance, reporting the average solving time per task would better substantiate the claimed cognitive alignment and reveal the true difficulty gradient.

Reviewer 02Rating 2Confidence 5

Strengths

- It is an interesting finding that the models achieve higher and more consistent accuracy in vertical (up/down) directions than in horizontal (left/right) ones in Move. Similarly, in symmetry tasks, performance is better for horizontal symmetry than for vertical symmetry - The authors implement a verifiable, scalable data engine capable of generating diverse reasoning tasks with controllable complexity—an important methodological contribution that ensures reproducibility and adaptability. - Th

Weaknesses

- Several aspects of the experimental design and presentation require clarification and stronger consistency. the selection of models across figures lacks transparency and methodological coherence. For example, Figure 5 does not specify why those particular four models were chosen, Figure 6 focuses solely on DeepSeek-R1 without justification, and Figure 7 switches to yet another subset of models while testing only two tasks. Such inconsistent model selection makes it difficult to assess wheth

Reviewer 03Rating 2Confidence 4

Strengths

Proposes an abstract reasoning framework based on cognitive levels. Designs verifiable code generators and solvers to ensure data quality and scalability.

Weaknesses

Inadequate Human Experiment Design: Alignment with human cognition is the core advantage of this Benchmark. However, the handling of this critical aspect is very weak in this paper. Although a human comparison experiment is provided, there is a lack of detailed age distribution and significance testing. Additionally, there is no explanation of how participants were motivated to complete the questionnaire seriously or how invalid responses were excluded. For the complex task of designing cognitiv

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · AI-based Problem Solving and Planning