# Comparing generative artificial intelligence platforms and nursing student performance on a women’s health nursing examination in Korea: a Rasch model approach

**Authors:** Eun Jeong Ko, Tae Kyung Lee, Geum Hee Jeong

PMC · DOI: 10.3352/jeehp.2025.22.23 · Journal of Educational Evaluation for Health Professions · 2025-09-05

## TL;DR

This study compared the performance of generative AI platforms and nursing students on a women's health exam in Korea using the Rasch model.

## Contribution

The study evaluates AI performance against nursing students using the Rasch model, a novel approach in AI and nursing education assessment.

## Key findings

- GPT-4o, ChatGPT free version, and Claude.ai outperformed the median student ability.
- AI performance varied, with some models scoring significantly lower than students.
- The Rasch model effectively assessed AI competency for nursing exams.

## Abstract

This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women’s health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.

The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT free version, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.

The items satisfied unidimensionality (DETECT=–0.16). Item difficulty parameter estimates ranged from –3.87 to 1.96 logits (mean=–0.61), with a mean difficulty index of 0.79. Examinees’ ability parameter estimates ranged from –0.71 to 3.15 logits (mean=1.17). GPT-4o, ChatGPT free version, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, –0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.

Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12770907/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12770907/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12770907/full.md

---
Source: https://tomesphere.com/paper/PMC12770907