# Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs

**Authors:** Ibrahim Güler, Gerrit Grieb, Armin Kraus, Martin Lautenbach, Henrik Stelling

PMC · DOI: 10.3390/diagnostics16030424 · Diagnostics · 2026-02-01

## TL;DR

This study evaluates how well multimodal large language models detect hand fractures in X-rays, finding that while some models perform reasonably well, they still have significant limitations.

## Contribution

The study introduces a multi-run evaluation framework to assess diagnostic accuracy and stability of MLLMs for hand fracture detection.

## Key findings

- GPT-5 Pro achieved the highest diagnostic accuracy (64.3%) and consistency (κ = 0.71) among the evaluated models.
- Mistral Medium 3.1 showed high agreement (κ = 0.88) but low accuracy (38.5%), indicating systematic errors.
- Scaphoid fractures were challenging for all models, and demographic inference (age, sex) performed poorly.

## Abstract

Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs. Methods: In total, images of hand radiographs of 65 adult patients with confirmed hand fractures (30 phalangeal, 30 metacarpal, 5 scaphoid) were evaluated by four models: GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1. Each image was independently analyzed five times per model using identical zero-shot prompts (1300 total inferences). Diagnostic accuracy, inter-run reliability (Fleiss’ κ), case-level agreement profiles, subgroup performance, and exploratory demographic inference (age, sex) were assessed. Results: GPT-5 Pro achieved the highest accuracy (64.3%) and consistency (κ = 0.71), followed by Gemini 2.5 Pro (56.9%, κ = 0.57). Mistral Medium 3.1 exhibited high agreement (κ = 0.88) despite low accuracy (38.5%), indicating systematic error (“confident hallucination”). Claude Sonnet 4.5 showed low accuracy (33.8%) and consistency (κ = 0.33), reflecting instability. While phalangeal fractures were reliably detected by top models, scaphoid fractures remained challenging. Demographic analysis revealed poor capabilities, with age estimation errors exceeding 12 years and sex prediction accuracy near random chance. Conclusions: Diagnostic accuracy and consistency are distinct performance dimensions; high intra-model agreement does not imply correctness. While GPT-5 Pro demonstrated the most favorable balance of accuracy and stability, other models exhibited critical failure modes ranging from systematic bias to random instability. At present, MLLMs should be regarded as experimental diagnostic reasoning systems rather than reliable standalone tools for clinical fracture detection.

## Full-text entities

- **Diseases:** Hand Fracture (MESH:D006230), fracture (MESH:D050723), hallucination (MESH:D006212), phalangeal fractures (MESH:C537571)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12897326/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12897326/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12897326/full.md

---
Source: https://tomesphere.com/paper/PMC12897326