Enhancing Large Language Models with Domain-specific Retrieval Augment   Generation: A Case Study on Long-form Consumer Health Question Answering in   Ophthalmology

Aidan Gilson; Xuguang Ai; Thilaka Arunachalam; Ziyou Chen; Ki Xiong; Cheong; Amisha Dave; Cameron Duic; Mercy Kibe; Annette Kaminaka; Minali; Prasad; Fares Siddig; Maxwell Singer; Wendy Wong; Qiao Jin; Tiarnan D.L.; Keenan; Xia Hu; Emily Y. Chew; Zhiyong Lu; Hua Xu; Ron A. Adelman; Yih-Chung; Tham; Qingyu Chen

arXiv:2409.13902·cs.CL·September 24, 2024·6 cites

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong, Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali, Prasad, Fares Siddig, Maxwell Singer, Wendy Wong, Qiao Jin, Tiarnan D.L., Keenan, Xia Hu, Emily Y. Chew, Zhiyong Lu, Hua Xu

PDF

Open Access

TL;DR

This study demonstrates that retrieval-augmented generation significantly improves the factual accuracy and evidence attribution of large language models in ophthalmology-based consumer health question answering, addressing hallucination issues.

Contribution

It introduces a domain-specific RAG pipeline with 70,000 ophthalmology documents and systematically evaluates its impact on LLM responses in medical question answering.

Findings

01

RAG reduces hallucinated and erroneous evidence in LLM responses

02

RAG improves evidence attribution and answer accuracy

03

Top retrieved documents are frequently used as references in responses

Abstract

Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Health Literacy and Information Accessibility

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece