# Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

**Authors:** Philipp Linde, Florian Fichter, Markus Dietlein, Ferdinand Sudbrock, Kambiz Afshar, Hendrik Dapper, Emmanouil Fokas, Anna-Lena Hillebrecht, Tobias Raupach, Matthias Carl Laupichler

PMC · DOI: 10.1038/s41746-025-02313-7 · NPJ Digital Medicine · 2026-01-08

## TL;DR

This study compares AI-generated and human-written multiple-choice questions in medical imaging fields, finding they are equally effective and hard to distinguish.

## Contribution

The study demonstrates that GPT-4o-generated questions match human-authored ones in psychometric quality and are undetectable to examinees.

## Key findings

- Item difficulty and discrimination did not significantly differ between GPT-4o and human-authored questions.
- Examinees could not reliably identify whether questions were generated by AI or humans.
- Expert ratings of question quality showed low agreement, suggesting subjective evaluation challenges.

## Abstract

Large language models (LLMs) have the potential to scale assessment in medical education, but their psychometric equivalence to expert-written items and the detectability of their origin remain uncertain. In a preregistered, single-center, blinded observational, within-subject comparison, we evaluated 24 GPT-4o–generated versus 24 human-authored topic-matched multiple-choice questions (MCQs) across radiation oncology, radiology, and nuclear medicine. Medical students (n = 82) and physicians (n = 46) completed an identical 48-item formative mock examination, with item origin masked. Item difficulty (human: mean 0.65 [SD 0.22] vs LLM: 0.67 [0.20]) and discrimination (0.27 [0.12] vs 0.29 [0.12]) did not differ significantly; participants did not identify item origin above chance (0.50). Expert ratings of appropriateness and didactic quality showed low interrater agreement (ICC = 0.07–0.18). In this expert-reviewed, human-in-the-loop workflow, the item difficulty and discriminatory power of MCQs generated with GPT-4o did not differ significantly from those of expert-authored items, and were not reliably recognized as AI-generated by examinees. These findings delineate a feasible pathway for responsibly scaling formative assessment content in imaging-focused medical education, while underscoring the need for explicit educational policies regarding oversight, transparency, and fairness.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12881591/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12881591/full.md

## References

5 references — full list in the complete paper: https://tomesphere.com/paper/PMC12881591/full.md

---
Source: https://tomesphere.com/paper/PMC12881591