Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository
Radhika Kapoor, Sang T. Truong, Nick Haber, Maria Araceli Ruiz-Primo, and Benjamin W. Domingue

TL;DR
This paper develops a regression model using linguistic, test, and context features, along with LLM embeddings, to predict reading comprehension item difficulty from text data, aiding educational assessment design.
Contribution
It introduces a novel annotated item repository and demonstrates that combined features and embeddings can effectively predict item difficulty, with potential for public use.
Findings
Regression model achieves RMSE of 0.59, outperforming baseline 0.92.
Model correlation between true and predicted difficulty is 0.77.
Using only linguistic features or LLM embeddings yields similar prediction performance.
Abstract
Prediction of item difficulty based on its text content is of substantial interest. In this paper, we focus on the related problem of recovering IRT-based difficulty when the data originally reported item p-value (percent correct responses). We model this item difficulty using a repository of reading passages and student data from US standardized tests from New York and Texas for grades 3-8 spanning the years 2018-23. This repository is annotated with meta-data on (1) linguistic features of the reading items, (2) test features of the passage, and (3) context features. A penalized regression prediction model with all these features can predict item difficulty with RMSE 0.59 compared to baseline RMSE of 0.92, and with a correlation of 0.77 between true and predicted difficulty. We supplement these features with embeddings from LLMs (ModernBERT, BERT, and LlAMA), which marginally improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
