Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations
Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger

TL;DR
This paper explores using open-source large language models to simulate student responses for estimating the difficulty of math questions, achieving high correlation with real-world data through role-play and stratification techniques.
Contribution
It introduces a simulation-based approach with LLMs to predict question difficulty, leveraging IRT models and diverse role-play strategies, outperforming direct judgment methods.
Findings
Correlations up to 0.82 with real-world difficulty data.
Diverse student role-plays improve prediction accuracy.
Weaker LLMs can outperform stronger models in difficulty estimation.
Abstract
Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a ``classroom'' of 4th, 8th, or 12th-grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
