# Comparative assessment of large language models in diabetic foot infection management: alignment with IWGDF/IDSA guidelines

**Authors:** Hongxia Wu, Jiayi Deng, Xu Qiu, Li Xu, Lumeng Lu, Mingna Fan, Danni Yu, Chuanbo Liu, Zhaohuan Chen, Kai Wang, Yuyan Wang, Haifang Zhou, Liyang Chang, Hanbin Wang

PMC · DOI: 10.3389/fendo.2026.1667159 · Frontiers in Endocrinology · 2026-02-24

## TL;DR

This study evaluates how well AI models align with guidelines for managing diabetic foot infections, finding that they are generally accurate but need more clinical testing.

## Contribution

The study introduces a comparative evaluation of AI models against DFI guidelines using specific clinical dimensions and readability metrics.

## Key findings

- Grok-3 outperformed other models in supplementary value and completeness dimensions.
- DeepSeek-R1 generated the most complex text based on readability metrics.
- All models showed comparable accuracy and overconclusiveness.

## Abstract

To assess the clinical utility of artificial intelligence (AI) models (ChatGPT-4o, DeepSeek-R1, Grok-3 and Claude-3.7) in aligning with international guidelines for diabetic foot infection (DFI) management.

AI systems have demonstrated their potential application value in numerous fields. However, the specific effects of these technologies in the medical and health sector still require in-depth exploration. DFI is a relatively common and serious complication among diabetic patients, and the accurate transmission of relevant information is of great significance. Therefore, it is particularly important to evaluate whether artificial intelligence can serve as an effective clinical auxiliary tool.

Responses from ChatGPT-4o, DeepSeek-R1, Grok-3 and Claude-3.7 were evaluated against DFI guidelines using four clinical dimensions (Accuracy, Overconclusiveness, Supplementary Value, and Completeness) using a 5-point Likert scale, and assessed for readability using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL). Statistical analyses included ANOVA and post hoc comparisons.

No significant differences were found across models for Accuracy and Overconclusiveness (p > 0.05). However, Supplementary Value differed significantly (p < 0.001), the performance of Grok-3 is superior to that of ChatGPT-4o (p < 0.0001), DeepSeek-R1 (p=0.003), and Claude-3.7 (p < 0.0001). Meanwhile, there are significant differences in terms of Completeness (p=0.005), Grok-3 outperforms ChatGPT-4o (p=0.016)and Claude-3.7 (p=0.010) significantly.Readability also varied: DeepSeek-R1 responses were more complex than ChatGPT-4o (p = 0.046).

All models perform comparably in terms of accuracy and in avoiding over-conclusions. Grok-3 outperformed the other models in the dimensions of complementarity and completeness. DeepSeek-R1 generated the most complex text. These findings validate the feasibility of AI in the standardized management of DFI, but the models still need to be further verified through clinical trials to determine their value in the real-world decision-making process.

## Full-text entities

- **Diseases:** DFI (MESH:D017719), diabetic (MESH:D003920)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12971450/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12971450/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/PMC12971450/full.md

---
Source: https://tomesphere.com/paper/PMC12971450