RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan

TL;DR
RIDE is a novel framework that uses Item Response Theory and large language models to generate challenging, well-posed mathematical questions for more accurate evaluation of LLM reasoning abilities.
Contribution
It introduces an adversarial question-rewriting method guided by IRT and LLMs to systematically evaluate and challenge mathematical reasoning in LLMs.
Findings
Achieves an average 21.73% performance drop across 26 models.
Generates well-posed, more difficult mathematical questions.
Exposes limitations in current LLM reasoning robustness.
Abstract
Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The methodology is rigorous and well-motivated. Difficulty estimation is carefully implemented via variational inference on the Rasch model with data augmentation (VAE + sampling) to stabilize parameter estimation. 2. The pairwise ranker formulation mitigates regression instability in symbolic domains and yields interpretable difficulty scores. Experimental coverage is extensive: 23 LLMs spanning 0.6 B–1 T parameters, multiple families (Qwen, LLaMA, DeepSeek, GPT, Gemini, etc.), and both open
1. The “student” LLM ability parameters (θ) are estimated but not analyzed—e.g., how they correlate with model size or reasoning specialization. 2. The difficulty ranker relies on text embeddings; semantic fidelity is measured indirectly. Cases where numerical tweaks superficially raise difficulty but not reasoning depth aren’t deeply analyzed. 3. The paper compares only to rule-based perturbations (e.g., GSM-Plus). It omits baselines like adversarial rewriting via contrastive prompting or reas
1. The paper's primary strength is its innovative use of Item Response Theory to formalize and quantify the concept of "question difficulty." The RIDE framework provides a more systematic and data-driven way to evolve difficulty. 2. The overall technical pipeline is well-conceived and executed. Key design choices are commendable. 3. The authors test a wide range of 26 state-of-the-art proprietary and open-source models, providing strong evidence for their claims. 4. The paper is exceptio
1. The entire framework hinges on the IRT difficulty estimates, which are derived from the performance of a specific cohort of 35 LLMs. This raises a question: does the framework measure intrinsic mathematical difficulty, or does it measure "difficulty-for-LLMs"? The generated questions might be overfitting to exploit common failure modes of the current llms rather than becoming more difficult in a way that would also challenge a human. 2. The RL training process relies heavily on GPT-5-mini
1. The paper is well-detailed in its presentation about the methodology and experiments. A comprehensive pipeline is designed for data augmentation that includes reward model training and rewriter training. 2. The paper introduced a simple model (IRT) that handsomely credits the hardness of the problem generates.
1. Given the limited LLM resources, the paper adopted multiple augmentation methods to extend problem-response data from LLM responses, including a VAE/sampling augmentation and the training of a pairwise difficulty ranker. However, as both methods act as a bootstrapping of existing responses, it is unclear that whether this approach exacerbates the overfitting of reward modeling. It is also unclear to the reader why it is necessary to augment the response matrix using VAE method or sampling me
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Advanced Graph Neural Networks
