# Accuracy and reliability of Manus, ChatGPT, and Claude in case-based dental diagnosis

**Authors:** Ahmed A. Madfa, Abdullah F. Alshammari, Bassam A. Anazi, Yousef E. Alenezi, Khlood A. Alkurdi

PMC · DOI: 10.3389/froh.2025.1686090 · 2026-01-08

## TL;DR

This study compares the diagnostic accuracy and consistency of three AI models—ChatGPT, Claude, and Manus—in dental case scenarios.

## Contribution

The study evaluates the performance of emerging AI platforms like Manus in dental diagnosis, an area previously underexplored.

## Key findings

- Claude and Manus showed higher diagnostic accuracy (92.3%) than ChatGPT (76.9%) in dental scenarios.
- Claude and Manus also demonstrated greater intra-model consistency compared to ChatGPT.
- Despite numerical advantages, differences between models were not statistically significant.

## Abstract

Artificial intelligence (AI), particularly large language models (LLMs), is transforming healthcare education and clinical decision-making. While models like ChatGPT and Claude have demonstrated utility in medical contexts, their performance in dental diagnostics remains underexplored; additionally, the potential of emerging platforms, like Manus, is yet to be evaluated.

To compare the diagnostic accuracy and consistency of the ChatGPT, Claude, and Manus—using authentic, case-based dental scenarios.

A set of 117 multiple-choice questions based on validated clinical dental vignettes spanning various specialities was administered to each model under standardised conditions at two separate time points. Responses were scored against expert-validated answer keys. Inter-rater reliability was assessed using Cohen's kappa, and statistical comparisons were made using the chi-square, McNemar, and t-tests.

Claude and Manus consistently outperformed ChatGPT across both testing phases. In the second round, Claude and Manus achieved a diagnostic accuracy of 92.3%, compared to ChatGPT's 76.9%. Claude and Manus also demonstrated higher intra-model consistency (Cohen's kappa = 0.714 and 0.782, respectively) than ChatGPT (kappa = 0.560). Although the numerical trends favoured Claude and Manus, pairwise differences in accuracy did not reach statistical significance.

Claude and Manus demonstrated numerically higher diagnostic performance and greater response stability compared with ChatGPT; however, these differences did not reach statistical significance and should therefore be interpreted cautiously. This variability across models highlights the need for larger-scale evaluations. These findings underscore the importance of considering both accuracy and consistency when selecting AI tools for integration into dental practice and curricula.

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12823890/full.md

---
Source: https://tomesphere.com/paper/PMC12823890