# Comparative study of advanced reasoning versus baseline large-language models for histopathological diagnosis in oral and maxillofacial pathology

**Authors:** Viet Anh Nguyen, Van Hung Nguyen, Thi Quynh Trang Vuong, Quoc Thanh Truong, Thi Trang Nguyen

PMC · DOI: 10.1371/journal.pone.0340220 · PLOS One · 2025-12-31

## TL;DR

Newer large language models with advanced reasoning outperform older versions in diagnosing oral and maxillofacial pathology cases, but need improvements in speed and consistency.

## Contribution

Demonstrates that advanced reasoning-augmented LLMs improve diagnostic accuracy and detail in histopathology compared to baseline models.

## Key findings

- The o3 model achieved 31.6% accuracy versus 18.7% for GPT-4o in diagnosing oral and maxillofacial cases.
- o3 provided more detailed descriptions for correct diagnoses but had slower response times and lower reproducibility.
- A board-certified general pathologist achieved 28.3% accuracy, highlighting the diagnostic challenge.

## Abstract

Large language models (LLMs) are increasingly explored as diagnostic copilots in digital pathology, but whether the newest reasoning-augmented architectures provide measurable benefits over earlier versions is unknown. We compared OpenAI’s o3 model, which uses an iterative planning loop, with the baseline GPT-4o on 459 oral and maxillofacial (OMF) cases drawn from standard textbooks. Each case consisted of two to five high-resolution haematoxylin-and-eosin micrographs, and both models were queried in zero-shot mode with an identical prompt requesting a single diagnosis and supporting microscopic features. Overall, o3 correctly classified 31.6% of cases, significantly surpassing GPT-4o at 18.7% (Δ = 12.9%, P < 0.001). The largest gain was recorded for the heterogeneous “other conditions” category (37.2% versus 20.2%). For correctly diagnosed cases, o3 generated more detailed descriptions (median Likert score 9 versus 8, P = 0.003). These benefits were offset by longer mean response time (98 s versus near-instant) and lower reproducibility across repeated queries (40.2% versus 57.6%). A board-certified general pathologist achieved 28.3% accuracy on the same image set, underscoring the difficulty of the task. Ground truth was established by two board-certified OMF pathologists with high inter-rater reliability, ensuring the reliability of the reference standard. The general pathologist served only as a non-OMF difficulty benchmark. The findings indicate that advanced reasoning mechanisms materially improve diagnostic performance and explanatory depth in complex histopathology, but additional optimisation is required to meet clinical speed and consistency thresholds. Clinically, such models are adjunctive ‘copilots’ for preliminary descriptions and differential diagnoses; expert OMF pathologists retain full responsibility for sign-out.

## Full-text entities

- **Chemicals:** haematoxylin (MESH:D006416), eosin (MESH:D004801)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12755752/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12755752/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12755752/full.md

---
Source: https://tomesphere.com/paper/PMC12755752