# Poster Session II A333 ARTIFICIAL INTELLIGENCE MODELS DEMONSTRATE PROMISING BUT SUBOPTIMAL PERFORMANCE IN DIAGNOSING AND TREATING DISORDERS OF GUT-BRAIN INTERACTION

**Authors:** M Ahn, C Parker

PMC · DOI: 10.1093/jcag/gwaf042.332 · 2026-02-13

## TL;DR

AI models show moderate accuracy in diagnosing gut-brain interaction disorders but have significant limitations in treatment recommendations.

## Contribution

This study evaluates the diagnostic and treatment accuracy of five AI models for disorders of gut-brain interaction using Rome IV standards.

## Key findings

- Perplexity and ChatGPT had the highest diagnostic accuracy at 74% and 72%, while Microsoft Copilot and OpenEvidence had the lowest at 65%.
- Treatment recommendations were most accurate for neonate/toddler disorders but poorest for gastroduodenal disorders.
- No significant differences were found between AI models in diagnostic or treatment accuracy.

## Abstract

Diagnosing disorders of gut-brain interaction (DGBIs) remains a persistent challenge in gastroenterology and primary care. Large language artificial intelligence models (LLM) have emerged as a potential tool to support clinical decision-making, yet their clinical applicability remains unclear.

To compare the diagnostic accuracy and treatment recommendations generated from five LLMs using clinical scenarios from the Rome IV Multidimensional Clinical Profile (MDCP).

68 case scenarios representing various DGBIs were entered into Chat GPT 4.0, Google Gemini 2.5 Pro, Microsoft Copilot, OpenEvidence, and Perplexity. Each model was provided a standardized prompt to produce a diagnosis and treatment options based on the case scenario. Responses were evaluated against MDCP standards and expert consensus for diagnostic and treatment accuracy. For diagnostic accuracy, a response was correct if it matched the MDCP diagnosis or used an accepted historical term. Partially correct responses identified a clinical modifier but an incorrect MDCP diagnosis or only a part of the MDCP diagnosis. Incorrect responses did not match the MDCP diagnosis, identified an incorrect subtype of a correct diagnosis or included unrelated additional diagnoses.

Diagnostic accuracy was highest with Perplexity (74%), ChatGPT (72%), Google Gemini (72%), and lowest with Microsoft Copilot (65%) and OpenEvidence (65%). Regarding treatment options, ChatGPT, Google Gemini and Microsoft Copilot each generated appropriate treatment options in between 53 to 54% of cases, compared with 41% each for OpenEvidence and Perplexity. When a subgroup analysis was conducted based on the MDCP diagnostic category, LLMs had the highest accuracy diagnosing child/adolescent conditions or esophageal disorders and poorest accuracy diagnosing anorectal disorders. For treatment options, LLMs had the highest accuracy for neonate/toddler disorders and gallbladder/sphincter of Oddi disorders and the poorest accuracy for gastroduodenal disorders. Statistical analysis revealed no significant differences between LLMs for diagnostic (χ2(4) = 4.26, p = 0.372) or treatment accuracy (χ2(4) = 8.069, p = 0.089).

LLMs demonstrated moderately accurate performance in determining the diagnosis of and treatment options for DGBIs. Diagnostic errors and omission of several options in the treatment of DGBIs highlights the risk of premature clinical adoption and need for further validation on strategies for clinical integration.

None

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12901613/full.md

---
Source: https://tomesphere.com/paper/PMC12901613