Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?
Leonidas Zotos, Hedderik van Rijn, Malvina Nissim

TL;DR
This paper investigates whether the uncertainty of large generative models can serve as a proxy for estimating the difficulty of multiple-choice questions, revealing weak correlations and differences based on answer correctness and question types.
Contribution
It explores the correlation between model uncertainty and actual student response distributions, introducing a new dataset and analyzing how uncertainty varies with question types and answer correctness.
Findings
Weak correlation between model uncertainty and question difficulty
Model behavior differs for correct and wrong answers
Correlation varies across different question types
Abstract
Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty, and exploit it towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models' behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
