# Large language model bias auditing for periodontal diagnosis using an ambiguity-probe methodology: a pilot study

**Authors:** Teerachate Nantakeeratipat

PMC · DOI: 10.3389/fdgth.2025.1687820 · Frontiers in Digital Health · 2026-01-05

## TL;DR

This pilot study explores how large language models handle periodontal diagnosis under clinical ambiguity, finding no sociodemographic bias but identifying diagnostic boundary instability.

## Contribution

This is among the first studies to use simulated clinical ambiguity to assess LLM fairness in dentistry, distinguishing between diagnostic errors and bias.

## Key findings

- GPT-4o showed higher accuracy than Gemini Pro in clear-cut periodontal scenarios.
- Neither model exhibited statistically significant sociodemographic bias in any scenarios.
- Errors were attributed to diagnostic boundary instability rather than bias.

## Abstract

Large Language Models (LLMs) in healthcare holds immense promise yet carries the risk of perpetuating social biases. While artificial intelligence (AI) fairness is a growing concern, a gap exists in understanding how these models perform under conditions of clinical ambiguity, a common feature in real-world practice.

We conducted a study using an ambiguity-probe methodology with a set of 42 sociodemographic personas and 15 clinical vignettes based on the 2018 classification of periodontal diseases. Ten were clear-cut scenarios with established ground truths, while five were intentionally ambiguous. OpenAI's GPT-4o and Google's Gemini 2.5 Pro were prompted to provide periodontal stage and grade assessments using 630 vignette-persona combinations per model.

In clear-cut scenarios, GPT-4o demonstrated significantly higher combined (stage and grade) accuracy (70.5%) than Gemini Pro (33.3%). However, a robust fairness analysis using cumulative link models with false discovery rate correction revealed no statistically significant sociodemographic bias in either model. This finding held true across both clear-cut and ambiguous clinical scenarios.

To our knowledge, this is among the first study to use simulated clinical ambiguity to reveal the distinct ethical fingerprints of LLMs in a dental context. While LLM performance gaps exist, our analysis decouples accuracy from fairness, demonstrating that both models maintain sociodemographic neutrality. We identify that the observed errors are not bias, but rather diagnostic boundary instability. This highlights a critical need for future research to differentiate between these two distinct types of model failure to build genuinely reliable AI.

## Full-text entities

- **Diseases:** periodontal diseases (MESH:D010510)
- **Chemicals:** 4o (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12812596/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12812596/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/PMC12812596/full.md

---
Source: https://tomesphere.com/paper/PMC12812596