AI-Assisted Generation of Difficult Math Questions
Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui, He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, Anirudh, Goyal

TL;DR
This paper introduces a hybrid LLM-human pipeline for generating diverse, challenging math questions by extracting core skills and combining them to create out-of-distribution problems, resulting in a high-quality dataset called MATH$^2$.
Contribution
It presents a novel framework that leverages LLM metacognition and human-in-the-loop refinement to produce difficult math questions with higher diversity and quality.
Findings
MATH$^2$ questions are more challenging for models than original MATH questions.
Models perform better on MATH when trained with MATH$^2$ as in-context examples.
Success on MATH$^2$ correlates quadratically with success on MATH.
Abstract
Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our…
Peer Reviews
Decision·Submitted to ICLR 2025
An AI-based framework in accelerating the data generation of mathematical reasoning and interesting observation for the performance dependency between the newly generated MATH^2 dataset and the original one.
Several metrics are missing to better understand the pipeline (e.g. number of questions being filtered out in each stage of the framework, sample skills outlined in the skill extraction stage). Besides, it is not measured whether the the generated questions faithfully integrate the skills mentioned (it is not hard to imagine that for challenging problems, the number of skills involved would exceed two). The qualification of humans getting involved for screening is not introduced. If the entire c
1. This work focuses on an AI-assisted framework for generating harder questions that models themselves struggle to solve. This is potentially of interest to the community, as many existing evaluation benchmarks are becoming saturated, often only reflecting model performance in overfitted domains. 2. The proposed framework results in the creation of a more difficult dataset, MATH$^2$, offering a new benchmark for evaluating mathematical reasoning abilities. 3. The content is clear and easy to fo
1. One major weakness is the generalizability of the proposed framework. Around 66% of AI filtered QAs need further human editing. This reliance on human annotation may impact the scalability and generalizability of the framework. 2. As an evaluation benchmark, the dataset is relatively small and lacks comprehensive coverage. Line 412 mentions that the dataset has not covered all possible skills. Given that the questions are initially generated automatically, it would be beneficial for the data
- The paper is well-structured and easy to follow - The example generation pipeline can be extended to other structured reasoning domains and could be valuable to the research community - Experimental results show that the generated examples are more challenging for existing models than the baseline MATH dataset and also served as effective in-context training exemplars
- The description of step 5 in Section 2 lacks clarity. The authors should provide more detail about the human re-validation process. For instance, do the human validators create their own solutions first and then compare them with those generated by the model? Additionally, it would be helpful to know the qualifications of these human experts, as the MATH dataset poses significant challenges, even for college-level students I list some minor concerns in the questions section
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing
