Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron

TL;DR
This paper introduces a statistically grounded method using Differential Item Functioning analysis to identify assessment items where humans and chatbots differ, aiding in designing fair and valid evaluations in the era of AI.
Contribution
It combines educational data mining and psychometric theory to systematically detect items vulnerable to AI misuse and characterize task features influencing chatbot performance.
Findings
DIF analysis effectively identifies items with systematic human-chatbot response differences.
The method reveals task dimensions that make problems easier or harder for AI.
DIF-informed analytics enhance understanding of AI-human capability divergence.
Abstract
The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
