Estimating Item Difficulty with Large Language Models as Experts

Diana Kolesnikova (1); Kirill Fedyanin (2); Abe D. Hofman (3; 4); Matthieu J. S. Brinkhuis (5); Maria Bolsinova (1) ((1) Department of Methodology; Statistics; Tilburg University; Tilburg; Netherlands; (2) Smart Business Technologies; Belgrade; Serbia; (3) Department of Psychological Methods; University of Amsterdam; Amsterdam; Netherlands; (4) Prowise Learn; Amsterdam; Netherlands; (5) Department of Information; Computing Sciences; Utrecht University; Utrecht; Netherlands)

arXiv:2605.18562·stat.ME·May 19, 2026

Estimating Item Difficulty with Large Language Models as Experts

Diana Kolesnikova (1), Kirill Fedyanin (2), Abe D. Hofman (3, 4), Matthieu J. S. Brinkhuis (5), Maria Bolsinova (1) ((1) Department of Methodology, Statistics, Tilburg University, Tilburg, Netherlands, (2) Smart Business Technologies, Belgrade, Serbia

PDF

TL;DR

This study evaluates the effectiveness of large language models as expert-like difficulty raters for new assessment items across multiple domains, comparing different elicitation procedures and prompting strategies.

Contribution

It provides empirical evidence that LLMs can reliably estimate item difficulty without response data, highlighting optimal configurations for initial item calibration.

Findings

01

LLMs showed moderate to strong correlation with empirical difficulty estimates.

02

Pairwise judgment outperformed absolute judgment without refinements.

03

Token-based prompting with examples improved absolute judgment accuracy.

Abstract

Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.