An Automatic Evaluation Framework for Multi-turn Medical Consultations   Capabilities of Large Language Models

Yusheng Liao; Yutong Meng; Hongcheng Liu; Yanfeng Wang; Yu Wang

arXiv:2309.02077·cs.CL·September 6, 2023·2 cites

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Yusheng Liao, Yutong Meng, Hongcheng Liu, Yanfeng Wang, Yu Wang

PDF

Open Access

TL;DR

This paper presents an automated evaluation framework for assessing large language models' capabilities in multi-turn medical consultations, focusing on accuracy, awareness of knowledge gaps, and diagnosis, with a new benchmark based on USMLE questions.

Contribution

It introduces a novel evaluation framework and benchmark for LLMs in medical consultations, along with a training set to enhance their diagnostic performance and reduce hallucinations.

Findings

01

Fine-tuning improves LLMs' accuracy in medical tasks.

02

The framework effectively detects hallucinations and knowledge gaps.

03

Training set enhances robustness and consultation skills.

Abstract

Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education

MethodsAttentive Walk-Aggregating Graph Neural Network