DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Ali Khoramfar; Ali Ramezani; Mohammad Mahdi Mohajeri; Mohammad Javad Dousti; Majid Nili Ahmadabadi; Heshaam Faili

arXiv:2505.24532·cs.CL·March 2, 2026

DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi, Heshaam Faili

PDF

Open Access

TL;DR

DeepQuestion is an automated framework that systematically increases the cognitive complexity of datasets to better evaluate LLMs' reasoning abilities in realistic scenarios, revealing significant performance gaps.

Contribution

We introduce DeepQuestion, a novel method for generating challenging, cognitively diverse datasets based on Bloom's taxonomy to evaluate LLMs more effectively.

Findings

01

Performance drops up to 70% on complex tasks

02

Current benchmarks overestimate reasoning abilities

03

Cognitive diversity is essential for meaningful evaluation

Abstract

While Large Language Models (LLMs) achieve near-human performance on standard benchmarks, their capabilities often fail to generalize to complex, real-world problems. To bridge this gap, we introduce DeepQuestion, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets. Grounded in Bloom's taxonomy, DeepQuestion generates (1) scenario-based problems to test the application of knowledge in noisy, realistic contexts, and (2) instruction-based prompts that require models to create new questions from a given solution path, assessing synthesis and evaluation skills. Our extensive evaluation across ten leading open-source and proprietary models reveals a stark performance decline with accuracy dropping by up to 70% as tasks ascend the cognitive hierarchy. These findings underscore that current benchmarks overestimate true reasoning abilities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCollaboration in agile enterprises · Semantic Web and Ontologies · ERP Systems Implementation and Impact