# Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

**Authors:** Trang Thi Nguyen, Linh Nguyen, Ha Thi Nguyet Do, Huong Thi Thu Nguyen, Son Minh Tong, Musa Ayanwale, Musa Ayanwale, Musa Ayanwale

PMC · DOI: 10.1371/journal.pone.0341317 · PLOS One · 2026-02-27

## TL;DR

This study evaluates how well AI models generate medical questions aligned with different levels of cognitive thinking.

## Contribution

The paper introduces a first-of-its-kind evaluation of LLMs' ability to generate MCQs aligned with Bloom’s Taxonomy in medical education.

## Key findings

- Claude Sonnet 4 outperformed other models at higher cognitive levels like analyzing and evaluating.
- ChatGPT-4o, DeepSeek R1, and Grok 3 performed better at lower cognitive levels.
- Inter-rater reliability was moderate to strong across all models.

## Abstract

While LLMs are used to generate medical and dental MCQs, their alignment with Bloom’s Taxonomy remains unexplored.

Five widely used LLMs, including ChatGPT-4o (OpenAI), Copilot Pro (Microsoft), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), and DeepSeek R1 (DeepSeek) were evaluated. Each model generated 60 MCQs (total 300) based on content from an oral and maxillofacial anatomy textbook across the five cognitive levels of Bloom’s Taxonomy. Two independent investigators assessed each item using a 5-point Likert scale for remembering, understanding, applying, analyzing, and evaluating/creating. Inter-rater reliability was measured using weighted Cohen’s kappa. Model performance and inter-model differences were analyzed using the Kruskal–Wallis test.

Inter-rater reliability was moderate to strong (kappa = 0.74–0.86). Median scores for remembering, understanding, applying, and evaluating/creating were above 4 across all LLMs, while the analyzing level scored a median of 3.5 for ChatGPT-4o and DeepSeek R1. No significant difference was found between models in remembering and understanding levels (p > 0.05). Claude Sonnet 4 outperformed the other models at the applying, analyzing, and evaluating/creating levels (p = 0.01, 0.003, and 0.005, respectively). Within-model analysis showed that only Copilot Pro and Claude Sonnet 4 consistently aligned with Bloom’s cognitive levels across all categories. In contrast, ChatGPT-4o, DeepSeek R1, and Grok 3 performed significantly better at the lower cognitive levels (p = 0.00, 0.00, and 0.001, respectively).

All LLMs performed well at lower cognitive levels, while Claude Sonnet 4 achieved the highest alignment at higher-order levels.

## Full-text entities

- **Diseases:** AI (MESH:C538142), LLMs (MESH:D007806), colorectal cancer (MESH:D015179)
- **Chemicals:** Claude (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** L39-43 — Mus musculus (Mouse), Hybridoma (CVCL_XK70)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12948114/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12948114/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/PMC12948114/full.md

---
Source: https://tomesphere.com/paper/PMC12948114