Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

Calvin Isley; Joshua Gilbert; Evangelos Kassos; Michaela Kocher; Allen Nie; Emma Brunskill; Ben Domingue; Jake Hofman; Joscha Legewie; Teddy Svoronos; Charlotte Tuminelli; Sharad Goel

arXiv:2508.08314·cs.CY·August 13, 2025

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

Calvin Isley, Joshua Gilbert, Evangelos Kassos, Michaela Kocher, Allen Nie, Emma Brunskill, Ben Domingue, Jake Hofman, Joscha Legewie, Teddy Svoronos, Charlotte Tuminelli, Sharad Goel

PDF

Open Access 1 Video

TL;DR

This large-scale study evaluates AI-generated exam questions' quality across diverse courses and finds they perform comparably to expert questions, demonstrating AI's potential to enhance assessment creation.

Contribution

Introduces and assesses an iterative AI question refinement method in a large educational field study, demonstrating AI's capability to produce high-quality exam questions.

Findings

01

AI-generated questions perform comparably to expert questions in IRT analysis.

02

The iterative refinement improves question quality through critique and revision.

03

AI can effectively generate high-quality assessments at scale.

Abstract

While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI's role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Student Assessment and Feedback · Psychometric Methodologies and Testing