Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review   and Meta-Analysis

Qiuhong Wei; Zhengxiong Yao; Ying Cui; Bo Wei; Zhezhen Jin; and Ximing; Xu

arXiv:2310.08410·stat.ME·March 12, 2024·J. Biomed. Informatics·5 cites

Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review and Meta-Analysis

Qiuhong Wei, Zhengxiong Yao, Ying Cui, Bo Wei, Zhezhen Jin, and Ximing, Xu

PDF

Open Access

TL;DR

This systematic review and meta-analysis assesses ChatGPT's accuracy in medical responses, highlighting its potential in healthcare but also emphasizing the need for standardized evaluation methods and better reporting practices.

Contribution

The paper provides a comprehensive summary and meta-analysis of existing studies on ChatGPT's medical performance, identifying methodological inconsistencies and guiding future research directions.

Findings

01

ChatGPT has an overall accuracy of 56% in medical queries.

02

Study heterogeneity and reporting issues limit reliability of results.

03

Potential for healthcare applications is promising despite current limitations.

Abstract

Large language models such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in medicine and provide direction for future research. We searched ten medical literature databases on June 15, 2023, using the keyword "ChatGPT". A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. The analysis showed that ChatGPT displayed an overall integrated accuracy of 56% (95% CI: 51%-60%, I2 = 87%) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. Moreover, many studies failed to report methodological details, including the version…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling