# Performance of Large Language Models in the Japanese Public Health Nurse National Examination: Comparative Cross-Sectional Study

**Authors:** Yutaro Takahashi, Ryota Kumakura, Rie Okamoto, Shizuko Omote

PMC · DOI: 10.2196/82842 · JMIR Nursing · 2026-02-20

## TL;DR

This study compares how well large language models perform on a Japanese public health nurse exam, finding they score high but struggle with certain question types.

## Contribution

First evaluation of LLM performance on the Japanese Public Health Nurse National Examination.

## Key findings

- All tested LLMs exceeded the 60% passing criterion with accuracy rates above 85%.
- LLMs showed lower accuracy on multiple-choice questions compared to single-choice questions.
- No significant performance differences were found between the three LLMs tested.

## Abstract

Large language models (LLMs) have shown promising results on Japanese national medical and nursing examinations. However, no study has evaluated LLM performance on the Japanese Public Health Nurse National Examination, which requires specialized knowledge in community health and public health nursing practice.

This study aimed to compare the performance of multiple LLMs on the Japanese Public Health Nurse National Examination and evaluate their potential utility in public health nursing education.

Three LLMs were evaluated: GPT-4o, Claude Opus 4, and Gemini 2.5 Pro. All 110 questions from the 111th Public Health Nurse National Examination were administered using standardized prompts. Questions were classified by format (text vs figure or calculation), content (general vs situational), and selection type (single vs multiple choice). Accuracy rates and 95% CIs were calculated, with statistical comparisons performed using chi-square tests.

All LLMs exceeded the passing criterion (60%). The accuracy rates were as follows: 85.5% (94/110) for GPT-4o (95% CI 77.5%‐91.5%), 91.8% (101/110) for Claude Opus 4 (95% CI 85.0%‐96.2%), and 92.7% (102/110) for Gemini 2.5 Pro (95% CI 86.2%‐96.8%). No significant differences were found among the LLMs (P>.99). However, all models showed lower accuracy on multiple-choice questions than on single-choice questions, with significant intramodel differences observed for GPT-4o (10/16, 62.5% vs 82/92, 89.1%; P=.01) and Claude Opus 4 (12/16, 75% vs 87/92, 94.6%; P=.03).

LLMs demonstrated high performance on a public health nursing examination but showed limitations in complex reasoning requiring multiple-choice selection. These findings suggest the potential for LLM use as educational support tools while highlighting the need for cautious implementation in specialized nursing education.

## Full-text entities

- **Diseases:** infectious disease (MESH:D003141), emergency (MESH:D004630), LLMs (MESH:D007806)
- **Chemicals:** GPT-4o (-), Pro (MESH:D011392)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12923089/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/PMC12923089/full.md

---
Source: https://tomesphere.com/paper/PMC12923089