Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation
Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo

TL;DR
This paper introduces DRE-Bench, a hierarchical, dynamic reasoning benchmark designed to evaluate the fluid intelligence of large language models, revealing current models' limitations in high-level abstract reasoning and generalization.
Contribution
The paper presents DRE-Bench, a novel, interpretable benchmark with dynamic variants for assessing fluid intelligence in LLMs, addressing limitations of existing reasoning tests.
Findings
LLMs perform well on low-level reasoning tasks
Models struggle with high-level cognition and generalization
Current LLMs exhibit limited fluid intelligence compared to humans
Abstract
Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The task design is well aligned with cognitive psychology, clearly reflecting different types of reasoning and the varying levels of intelligence required for each task. 2. The proposed benchmark is a valuable contribution to the evaluation community, offering diverse and controllable challenging tasks. 3. The paper provides a thorough evaluation and analysis across multiple LLMs, shwoing how accuracy and stability vary with task complexity and cognitive level.
While the paper proposes a cognitively inspired hierarchy of reasoning tasks, it does not measure the time humans take to solve them. Given the small-scale human evaluation and potential variance, reporting the average solving time per task would better substantiate the claimed cognitive alignment and reveal the true difficulty gradient.
- It is an interesting finding that the models achieve higher and more consistent accuracy in vertical (up/down) directions than in horizontal (left/right) ones in Move. Similarly, in symmetry tasks, performance is better for horizontal symmetry than for vertical symmetry - The authors implement a verifiable, scalable data engine capable of generating diverse reasoning tasks with controllable complexity—an important methodological contribution that ensures reproducibility and adaptability. - Th
- Several aspects of the experimental design and presentation require clarification and stronger consistency. the selection of models across figures lacks transparency and methodological coherence. For example, Figure 5 does not specify why those particular four models were chosen, Figure 6 focuses solely on DeepSeek-R1 without justification, and Figure 7 switches to yet another subset of models while testing only two tasks. Such inconsistent model selection makes it difficult to assess wheth
Proposes an abstract reasoning framework based on cognitive levels. Designs verifiable code generators and solvers to ensure data quality and scalability.
Inadequate Human Experiment Design: Alignment with human cognition is the core advantage of this Benchmark. However, the handling of this critical aspect is very weak in this paper. Although a human comparison experiment is provided, there is a lack of detailed age distribution and significance testing. Additionally, there is no explanation of how participants were motivated to complete the questionnaire seriously or how invalid responses were excluded. For the complex task of designing cognitiv
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · AI-based Problem Solving and Planning
