UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions
Ana-Cristina Rogoz, Radu Tudor Ionescu

TL;DR
This paper presents a novel data augmentation approach using Large Language Models to predict item difficulty and response time for medical exam questions, demonstrating potential improvements in automated assessment systems.
Contribution
Introduces a new LLM-based data augmentation method for predicting question difficulty and response time in medical exams, with analysis of feature combinations and model performance.
Findings
Predicting question difficulty remains challenging.
Including question text improves prediction accuracy.
LLM answer variability enhances model performance.
Abstract
This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs (Falcon, Meditron, Mistral) and employing transformer-based models based on six alternative feature combinations. The results suggest that predicting the difficulty of questions is more challenging. Notably, our top performing methods consistently include the question text, and benefit from the variability of LLM answers, highlighting the potential of LLMs for improving automated assessment in medical licensing exams. We make our code available https://github.com/ana-rogoz/BEA-2024.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducational Technology and Assessment
