# Assessing the Utility of AI Versus Human-Created MCQs in Pediatric Medical Education

**Authors:** James Knight, Richard G. McGee, Bunmi S. Malau-Aduli

PMC · DOI: 10.1177/23821205261427885 · Journal of Medical Education and Curricular Development · 2026-03-03

## TL;DR

This study compares the quality of multiple-choice questions generated by AI versus humans in pediatric medical education, finding that human-authored questions are currently superior.

## Contribution

The study provides the first direct comparison of AI-generated and human-authored MCQs in pediatrics using classical test theory metrics.

## Key findings

- Human-authored questions outperformed AI-generated ones in discrimination and difficulty indices.
- AI-generated questions had more nonfunctional distractors and lower consistency in psychometric quality.
- AI shows potential as a supplementary tool in hybrid workflows with human oversight.

## Abstract

Multiple-choice questions (MCQs) remain central to assessment in medical education, but their development is resource intensive. Generative artificial intelligence (AI) offers a potential solution by automating MCQ creation. However, little is known about the psychometric quality of AI-generated MCQs compared with human-authored items, particularly in pediatric education.

This study aimed to directly compare the quality of AI- and human-generated MCQs in pediatrics using item analysis grounded in classical test theory.

A formative exam comprising both AI (Microsoft Copilot) and human-generated pediatric MCQs was administered to 4th-year medical students. Item analysis was performed to calculate difficulty indices, discrimination indices, item-total correlations, and distractor functioning. Reliability was assessed using KR-20. Descriptive and inferential statistics, including paired t-tests, compared performance between AI and human items.

Human-authored questions outperformed AI-generated questions across all quality indicators. AI questions showed lower discrimination (mean 0.19 vs. 0.29) and a higher proportion outside the acceptable difficulty range (56% vs. 32%). Distractor analysis also favored human questions, with fewer nonfunctioning distractors and more items containing fully functional distractors. While some AI items met ideal psychometric thresholds, overall consistency was lower.

Generative AI in its current form cannot yet match human expertise in producing consistently high-quality MCQs for pediatrics. However, AI shows potential as a supplementary tool, particularly within hybrid human–AI workflows that combine efficiency with expert oversight. These findings highlight both the opportunities and limitations of AI in medical education assessment and underscore the importance of balancing reliability, validity, acceptability, and cost-effectiveness when integrating AI into assessment design.

## Full-text entities

- **Diseases:** AI (MESH:C538142), hyperbilirubinemia (MESH:D006932), hemolysis (MESH:D006461), infectious diseases (MESH:D003141), ORCID iDs (MESH:C535742)
- **Chemicals:** MCQ (-), bilirubin (MESH:D001663)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12957612/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12957612/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/PMC12957612/full.md

---
Source: https://tomesphere.com/paper/PMC12957612