Estimating Item Difficulty with Large Language Models as Experts
Diana Kolesnikova (1), Kirill Fedyanin (2), Abe D. Hofman (3, 4), Matthieu J. S. Brinkhuis (5), Maria Bolsinova (1) ((1) Department of Methodology, Statistics, Tilburg University, Tilburg, Netherlands, (2) Smart Business Technologies, Belgrade, Serbia

TL;DR
This study evaluates the effectiveness of large language models as expert-like difficulty raters for new assessment items across multiple domains, comparing different elicitation procedures and prompting strategies.
Contribution
It provides empirical evidence that LLMs can reliably estimate item difficulty without response data, highlighting optimal configurations for initial item calibration.
Findings
LLMs showed moderate to strong correlation with empirical difficulty estimates.
Pairwise judgment outperformed absolute judgment without refinements.
Token-based prompting with examples improved absolute judgment accuracy.
Abstract
Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
