# Evaluating Large Language Models for Diagnostic Accuracy and Health Information Quality in Oral Mucosal Diseases

**Authors:** Melisa Iacob, Ayham Qawas, Ramesh Balasubramaniam, Agnieszka M. Frydrych, Omar Kujan

PMC · DOI: 10.3390/jpm16030129 · 2026-02-27

## TL;DR

This study compares how well large language models and search engines diagnose oral diseases, finding that ChatGPT 4.5 performs best but still has readability issues.

## Contribution

The study introduces a novel evaluation of MLLMs for oral mucosal disease diagnosis, comparing them to traditional search engines.

## Key findings

- ChatGPT 4.5 showed highest diagnostic accuracy (88.5%) and PPV (92%) among MLLMs.
- Traditional search engines had much lower accuracy (18–55%) compared to MLLMs.
- MLLMs provided higher-quality information but were less readable than search engine results.

## Abstract

Background: Multimodal large language model (MLLM)-based systems capable of generating health-related information and diagnostic suggestions are increasingly used for health information retrieval; however, their accuracy, readability, and quality in oral healthcare remain unclear. Oral mucosal diseases comprise a heterogeneous group of conditions affecting the oral lining, ranging from benign and reactive lesions to potentially malignant and malignant disorders. Objective: This study evaluated and compared the diagnostic performance, readability, and information quality of MLLMs with traditional search engines included as comparator platforms, in diagnosing oral mucosal diseases. Methods: A cross-sectional observational study was conducted using 100 validated oral mucosal case scenarios representing benign, malignant, potentially malignant, infectious, and reactive oral lesions. Each scenario was entered into ChatGPT 3.5, ChatGPT 4.5 (Plus), Microsoft Copilot (smart), Grok (xAI), Claude (Sonnet 4.5), DeepSeek v3.1, and search engines Google, Bing, and Yahoo. Diagnostic accuracy, Positive Predictive Value (PPV), and Negative Predictive Value (NPV) were compared against reference diagnoses. Information quality was assessed using the DISCERN tool, and readability was evaluated using Flesch–Kincaid Reading Ease (FRES) and Grade Level (FKGL) scores. Statistical analyses included Cochran’s Q and McNemar tests (p < 0.05). Results: ChatGPT 4.5 demonstrated the highest overall diagnostic accuracy (88.5%), PPV (92%), and NPV (88%), followed by DeepSeek v3.1 and Claude (Sonnet 4.5). Traditional search engines performed poorly (accuracy 18–55%). MLLMs achieved higher DISCERN scores (2.84–3.20) but lower readability (FKGL = 11–14) than search engines (FKGL = 6–7). No platform met the recommended sixth-grade reading level for consumer health information. Conclusions: MLLMs, particularly ChatGPT Plus (GPT-4.5), outperformed conventional search engines in diagnostic accuracy and content quality but produced complex, less-readable text. Future AI development should prioritise improving clinical accuracy alongside readability and transparency to ensure equitable access to reliable oral health information.

## Full-text entities

- **Diseases:** hallucinations (MESH:D006212), injury to (MESH:D014947), Oral infections (MESH:D007239), AI (MESH:C538142), Oral diseases (MESH:D009059), oral (MESH:D020820), periodontal disease (MESH:D010510), OPMD (MESH:D039141), oral mucosal condition (MESH:D013280), orofacial pain (MESH:D005157), Malignant lesions (MESH:D009369), temporomandibular disorders (MESH:D013705), lip and oral cavity cancers (MESH:D008048), infectious lesions (MESH:D003141), MLLMs (MESH:D007806), oral cancer (MESH:D009062), anxiety (MESH:D001007), salivary gland dysfunction (MESH:D012466)
- **Chemicals:** Irrelevant (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13027880/full.md

---
Source: https://tomesphere.com/paper/PMC13027880