# Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items

**Authors:** Carlos Ramon Hölzing, Charlotte Meynhardt, Patrick Meybohm, Sarah König, Peter Kranke

PMC · DOI: 10.2196/84904 · JMIR Formative Research · 2026-02-18

## TL;DR

This study compares multiple-choice questions generated by a fine-tuned AI model with those written by experts in anesthesiology and finds similar quality in both.

## Contribution

The study demonstrates that fine-tuned LLMs can produce anesthesiology MCQs with psychometric properties comparable to expert-written items.

## Key findings

- LLM-generated and expert-written MCQs showed no significant differences in difficulty, point-biserial correlation, or discrimination index.
- Both expert and LLM-generated items had modest psychometric quality.
- Automated item generation should complement, not replace, manual writing due to limitations in psychometric indices.

## Abstract

Multiple-choice examinations (MCQs) are widely used in medical education to ensure standardized and objective assessment. Developing high-quality items requires both subject expertise and methodological rigor. Large language models (LLMs) offer new opportunities for automated item generation. However, most evaluations rely on general-purpose prompting, and psychometric comparisons with faculty-written items remain scarce.

This study aimed to evaluate whether a fine-tuned LLM can generate MCQs (Type A) in anesthesiology with psychometric properties comparable to those written by expert faculty.

The study was embedded in the regular written anesthesiology examination of the eighth-semester medical curriculum with 157 students. The examination comprised 30 single best-answer MCQs, of which 15 were generated by senior faculty and 15 by a fine-tuned GPT-based model. A custom GPT-based (GPT-4) model was adapted with anesthesiology lecture slides, the National Competence-Based Learning Objectives Catalogue (NKLM 2.0), past examination questions, and faculty publications using supervised instruction-tuning with standardized prompt–response pairs. Item analysis followed established psychometric standards.

In total, 29 items (14 expert, 15 LLM-generated) were analyzed. Expert-generated questions had a mean difficulty of 0.81 (SD 0.19), point-biserial correlation of 0.19 (SD 0.07), and discrimination index of 0.09 (SD 0.08). LLM-generated items had a mean difficulty of 0.79 (SD 0.18), point-biserial correlation of 0.17 (SD 0.04), and discrimination index of 0.08 (SD 0.11). Mann-Whitney U tests revealed no significant differences between expert- and LLM-generated items for difficulty (P=.38), point-biserial correlation coefficient (P=.96), or discrimination index (P=.59). Categorical analyses confirmed no significant group differences. Both sets, however, showed only modest psychometric quality.

Supervised fine-tuned LLMs are capable of generating MCQs with psychometric properties comparable to those written by experienced faculty. Given the limitations and cohort-dependency of psychometric indices, automated item generation should be considered a complement rather than a replacement for manual item writing. Further research with larger item sets and multi-institutional validation is needed to confirm generalizability and optimize integration of LLM-based tools into assessment development.

## Full-text entities

- **Diseases:** AI (MESH:C538142)
- **Chemicals:** MCQ (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12916093/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12916093/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12916093/full.md

---
Source: https://tomesphere.com/paper/PMC12916093