An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models
Yusheng Liao, Yutong Meng, Hongcheng Liu, Yanfeng Wang, Yu Wang

TL;DR
This paper presents an automated evaluation framework for assessing large language models' capabilities in multi-turn medical consultations, focusing on accuracy, awareness of knowledge gaps, and diagnosis, with a new benchmark based on USMLE questions.
Contribution
It introduces a novel evaluation framework and benchmark for LLMs in medical consultations, along with a training set to enhance their diagnostic performance and reduce hallucinations.
Findings
Fine-tuning improves LLMs' accuracy in medical tasks.
The framework effectively detects hallucinations and knowledge gaps.
Training set enhances robustness and consultation skills.
Abstract
Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
MethodsAttentive Walk-Aggregating Graph Neural Network
