Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information
Youcheng Huang, Bowen Qin, Chen Huang, Duanyu Feng, Xi Yang, Wenqiang Lei

TL;DR
This paper evaluates large reasoning models' ability to proactively ask for information in incomplete problem scenarios, revealing current limitations and highlighting the need for more intelligent, interactive AI systems.
Contribution
It introduces a new dataset for incomplete problems and systematically assesses LRMs' ability to ask for information, exposing gaps in their proactive reasoning capabilities.
Findings
LRMs struggle to ask for missing information in incomplete problems.
Overthinking and hallucination behaviors are prevalent in LRMs.
Supervised fine-tuning shows potential but faces challenges.
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* I believe that the premise of this work - training of models being heavily geared towards problem solving, overlooking other important aspects such as information seeking / question generation - is important and the work is a good step in addressing that that. * The paper includes an analysis of results is detailed and elaborate * Interesting analysis in Figure 7 which potentially shows that the clarification questions decrease with decrease in model confidence.
While the problem studied in the manuscript is important, I have concerns about the experimental setup * Majority of the evaluation considered in this paper seems to be reliant on the LLM-as-a-judge setup and I am not convinced about its reliability, given the open ended nature of the tasks (for eg. Evaluation where the response includes a clarification request, i.e. it is more complex than simply assessing the equivalence of two expressions). The authors provide a Human Evaluation confirm
- Clear motivation. Convincingly argues that genuine mathematical intelligence requires proactively asking for missing information, not just solving well-posed problems, and anchors this in concrete real-world examples and prior perspectives on intelligence. - The paper introduces a clear definition of incomplete mathematical problems. - Insightful fine-grained analysis that goes beyond averages to examine thoughts, reflection steps, and noticing vs. acting on incompleteness—surfacing “thought
- The dataset is largely LLM-synthesized (DeepSeek R1 and Gemini 2.5) via templated disturbances (blanking the goal or removing a premise), potentially creating somewhat "artificial" patterns, rather than simulating how real users naturally phrase underspecified or ambiguous math questions. Human spot checks can not completely solve this. - The key metrics depend on LLM-as-a-judge (primarily DeepSeek R1) to detect “clarification,” even while R1 is an evaluated model. This creates severe circular
- Problem framing is timely. Asking for clarification on incomplete inputs is useful, and the paper makes this failure mode concrete for math. The coarse and fine-grained metrics (CR/ACC, TLC/TLNC, RS/ROR/CNR) are clearly defined. - Clear empirical signal. Under “implicit prompts,” clarification ratios hover around ~25–35% and only rise to ~50–60% with explicit instructions, which is a crisp takeaway. - Readable, well-organized paper. The data construction pipeline is explained step-by-step, a
- Benchmark necessity vs. saturation. We already have substantial math benchmarks (e.g., MATH (12.5k), GSM8K (8.5k), Omni-MATH (4.4k Olympiad-level)) and even multimodal math benchmarks like MathVista targeting visual reasoning. The paper’s case for “another” math dataset is not fully persuasive. Its novelty is primarily removal of goals/premises plus evaluation of question-asking. Related efforts like QuestBench study information acquisition explicitly (albeit with multiple-choice question sele
1. The paper is clearly expressed, with well-defined motivations and solutions, making it easy to read. 2. The paper proposes a new benchmark, which may be of certain help to subsequent work in this field.
1. Current large language models are not solely evaluated on solving well-defined mathematical problems. Currently, there are numerous benchmarks and discussions regarding ill-defined, pathological, and unsolvable mathematical problems, and the current difficulty lies in solving such problems. 2. The conclusion of the paper is not novel. The inadequacy of large language models in proactively asking for information has already been widely discussed in the field. So this is hardly a new finding. 3
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
