Using Vision + Language Models to Predict Item Difficulty
Samin Khan

TL;DR
This study demonstrates that large language models, especially multimodal approaches combining text and images, can effectively predict item difficulty in data visualization literacy tests, outperforming unimodal models.
Contribution
The paper introduces a multimodal LLM-based method for predicting item difficulty, showing improved accuracy over unimodal models in psychometric analysis.
Findings
Multimodal models achieved the lowest MAE of 0.224.
Text-only and vision-only models had higher errors of 0.338 and 0.282 respectively.
The multimodal approach successfully predicted difficulty on unseen test data.
Abstract
This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Data Visualization and Analytics · Text Readability and Simplification
