# Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models

**Authors:** Jacob P. S. Nielsen, August Krogh Mikkelsen, Julian Kuenzel, Merry E. Sebelik, Gitta Madani, Tsung-Lin Yang, Tobias Todsen

PMC · DOI: 10.3390/diagnostics15151848 · Diagnostics · 2025-07-22

## TL;DR

This study compares multiple-choice questions on head and neck ultrasound created by doctors and large language models, finding that while LLMs can generate acceptable drafts, expert validation is still needed for quality.

## Contribution

The study evaluates the quality of LLM-generated MCQs for head and neck ultrasound compared to physician and expert-validated questions.

## Key findings

- LLM-generated MCQs had quality comparable to physician drafts but scored lower than expert-validated questions.
- LLMs showed no significant differences between each other but differed from physician-drafted questions in relevance and rationale.
- LLMs can provide cost-effective drafts, but expert validation is necessary for high-quality assessments.

## Abstract

Background/Objectives: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head and neck ultrasound, compared with MCQs generated by physicians. Methods: Physicians and LLMs (ChatGPT, GPT4o, and Google Gemini, Gemini Advanced) created a total of 90 MCQs that covered the topics of lymph nodes, thyroid, and salivary glands. Experts in HNUS additionally evaluated all physician-drafted MCQs using a Delphi-like process. The MCQs were assessed by an international panel of experts in HNUS, who were blinded to the source of the questions. Using a Likert scale, the evaluation was based on an overall assessment including six assessment criteria: clarity, relevance, suitability, quality of distractors, adequate rationale of the answer, and an assessment of the level of difficulty. Results: Four experts in the clinical field of HNUS assessed the 90 MCQs. No significant differences were observed between the two LLMs. Physician-drafted questions (n = 30) had significant differences with Google Gemini in terms of relevance, suitability, and adequate rationale of the answer, but only significant differences in terms of suitability compared with ChatGPT. Compared to MCQ items (n = 16) validated by medical experts, LLM-constructed MCQ items scored significantly lower across all criteria. The difficulty level of the MCQs was the same. Conclusions: Our study demonstrates that both LLMs could be used to generate MCQ items with a quality comparable to drafts from physicians. However, the quality of LLM-generated MCQ items was still significantly lower than MCQs validated by ultrasound experts. LLMs are therefore cost-effective to generate a quick draft for MCQ items that afterward should be validated by experts before being used for assessment purposes. In this way, the value of LLM is not the elimination of humans, but rather vastly superior time management.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12346108/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12346108/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12346108/full.md

---
Source: https://tomesphere.com/paper/PMC12346108