# Assessing the Accuracy of Artificial Intelligence Chatbots in the Diagnosis and Management of Meniscal Tears

**Authors:** Jason S DeFrancisis, Peter Richa, Daniel Oar, Luke Henwood, Zachary J Buchman, Gagan Grewal

PMC · DOI: 10.7759/cureus.84124 · Cureus · 2025-05-14

## TL;DR

This study compares the accuracy of two AI chatbots in providing information about meniscal tears and finds minimal differences between them, but highlights the need for better verification sources.

## Contribution

The study is one of the first to evaluate AI chatbots specifically for orthopaedic medical information accuracy.

## Key findings

- ChatGPT-4o and Gemini 2.0 Flash showed no significant difference in the number of statements or accuracy of responses.
- Using both UpToDate and peer-reviewed articles increased verifiable statements from 58.61% to 84.11%.
- AI chatbots have clinical limitations and cannot replace orthopaedic surgeons' expertise.

## Abstract

Introduction: Artificial intelligence (AI) chatbots have emerged as readily accessible tools for providing medical information to the public. However, the accuracy of AI chatbot responses, particularly in specialized medical fields such as orthopaedic surgery, remains largely understudied.

Objective: This study aims to evaluate the accuracy of responses from two prominent free AI chatbots when posed with frequent questions about meniscus tears, a common orthopaedic injury.

Methods: The two AI chatbots assessed in this study were ChatGPT-4o and Gemini 2.0 Flash. The analysis focused on the number of statements provided by each chatbot and the percentage of verifiable statements based on UpToDate alone, as well as UpToDate combined with peer-reviewed articles as of March 2025.

Results: The results showed no statistically significant difference in the average number of statements generated per question between the two AI chatbots. ChatGPT-4o provided an average of 18.25 statements per question, while Gemini 2.0 Flash generated 19.50 statements per question (p>0.05). Similarly, there was no significant difference in the percentage of verifiable statements provided by each AI chatbot. ChatGPT-4o achieved 58.22% verifiable statements compared to Gemini 2.0 Flash’s 58.97% when using UpToDate as the sole verification source, and 83.56% versus 84.62%, respectively, when incorporating both UpToDate and peer-reviewed articles as verification sources (p>0.05). However, a statistically significant difference in the percentage of verifiable statements was observed based on the verification source used. UpToDate alone resulted in 58.61% of verifiable statements, while combining UpToDate and peer-reviewed articles increased this percentage to 84.11% (p<0.0001).

Conclusion: Overall, the results of this study suggest that there are minimal differences between free AI chatbots in providing orthopaedic medical information. The results also emphasize the importance of utilizing broader verification sources to enhance the accuracy of AI-generated statements. The study indicates that AI chatbots have clinical limitations in their accuracy and understanding of specific orthopaedic conditions. The authors suggest that although AI chatbots can contribute to orthopaedic care and patient education, they are not capable of replacing the clinical judgment or expertise of orthopaedic surgeons.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12165731/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12165731/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12165731/full.md

---
Source: https://tomesphere.com/paper/PMC12165731