oRetrieval Augmented Generation for 10 Large Language Models and its   Generalizability in Assessing Medical Fitness

Yu He Ke; Liyuan Jin; Kabilan Elangovan; Hairil Rizal Abdullah; Nan; Liu; Alex Tiong Heng Sia; Chai Rick Soh; Joshua Yi Min Tung; Jasmine Chiat; Ling Ong; Chang-Fu Kuo; Shao-Chun Wu; Vesela P. Kovacheva; Daniel Shu Wei; Ting

arXiv:2410.08431·cs.CL·October 14, 2024

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Yu He Ke, Liyuan Jin, Kabilan Elangovan, Hairil Rizal Abdullah, Nan, Liu, Alex Tiong Heng Sia, Chai Rick Soh, Joshua Yi Min Tung, Jasmine Chiat, Ling Ong, Chang-Fu Kuo, Shao-Chun Wu, Vesela P. Kovacheva, Daniel Shu Wei, Ting

PDF

Open Access

TL;DR

This study evaluates retrieval-augmented large language models for medical preoperative assessments, demonstrating high accuracy, speed, and consistency across guidelines, with GPT4 achieving 96.4% correctness and no hallucinations.

Contribution

It introduces LLM-RAG models tailored for medical preoperative tasks, showing their effectiveness and generalizability across diverse clinical guidelines.

Findings

01

GPT4 LLM-RAG achieved 96.4% accuracy in assessments.

02

Models responded within 20 seconds, faster than clinicians.

03

Responses were consistent and hallucination-free.

Abstract

Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare · Machine Learning in Healthcare · Online Learning and Analytics

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Layer · Weight Decay · WordPiece · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · BERT