MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large   Language Models in Medicine

Jie Xu; Lu Lu; Sen Yang; Bilin Liang; Xinwei Peng; Jiali Pang; Jinru; Ding; Xiaoming Shi; Lingrui Yang; Huan Song; Kang Li; Xin Sun; Shaoting Zhang

arXiv:2305.07340·cs.CL·May 15, 2023·2 cites

MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine

Jie Xu, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru, Ding, Xiaoming Shi, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang

PDF

Open Access

TL;DR

This paper introduces MedGPTEval, a comprehensive dataset and benchmark designed to evaluate large language models' responses in medical contexts, focusing on professional, social, contextual, and robustness capabilities.

Contribution

It develops a novel evaluation framework with expert-optimized criteria and creates medical datasets for benchmarking LLMs in clinical scenarios.

Findings

01

Dr. PJ outperforms ChatGPT and ERNIE Bot in medical dialogue tasks.

02

Evaluation criteria cover 16 indicators across professional and social capabilities.

03

Benchmark results demonstrate Dr. PJ's superior performance in medical response quality.

Abstract

METHODS: First, a set of evaluation criteria is designed based on a comprehensive literature review. Second, existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering. Third, three clinical experts design a set of medical datasets to interact with LLMs. Finally, benchmarking experiments are conducted on the datasets. The responses generated by chatbots based on LLMs are recorded for blind evaluations by five licensed medical experts. RESULTS: The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with sixteen detailed indicators. The medical datasets include twenty-seven medical dialogues and seven case reports in Chinese. Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Social Media in Health Education

MethodsERNIE