Evaluating LLM-Generated Q&A Test: a Student-Centered Study

Anna Wr\'oblewska; Bartosz Grabek; Jakub \'Swistak; Daniel Dan

arXiv:2505.06591·cs.CL·August 8, 2025

Evaluating LLM-Generated Q&A Test: a Student-Centered Study

Anna Wr\'oblewska, Bartosz Grabek, Jakub \'Swistak, Daniel Dan

PDF

TL;DR

This study develops an AI-based pipeline to generate and evaluate Q&A tests, demonstrating that LLM-generated assessments can match human tests in quality and psychometric properties, supporting scalable AI-assisted assessment creation.

Contribution

Introduces an automated method for creating and assessing Q&A tests using LLMs, validated through psychometric analysis and user ratings, showing comparable quality to human-authored tests.

Findings

01

Generated items show strong discrimination and appropriate difficulty.

02

High user satisfaction with LLM-generated assessments.

03

Two items identified for review due to differential item functioning.

Abstract

This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.