Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots   in Ophthalmology and LLM-based evaluation using GPT-4

Ting Fang Tan; Kabilan Elangovan; Liyuan Jin; Yao Jie; Li Yong; Joshua; Lim; Stanley Poh; Wei Yan Ng; Daniel Lim; Yuhe Ke; Nan Liu; Daniel Shu Wei; Ting

arXiv:2402.10083·cs.AI·February 16, 2024·6 cites

Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4

Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua, Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei, Ting

PDF

Open Access

TL;DR

This study evaluates GPT-4's ability to assess ophthalmology chatbot responses, finding strong alignment with clinicians and highlighting its potential to streamline healthcare AI validation.

Contribution

It demonstrates GPT-4's effectiveness in automatically evaluating ophthalmology chatbot responses, aligning well with clinician judgments, and identifying inaccuracies.

Findings

01

GPT-4 evaluation correlates highly with clinician rankings (Spearman 0.90).

02

GPT-4 effectively identifies clinical inaccuracies in LLM responses.

03

Fine-tuned LLMs show varying performance, with GPT-3.5 scoring highest.

Abstract

Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset, additional 8 glaucoma QnA pairs were included. 200 responses to the testing dataset were generated by 5 fine-tuned LLMs for evaluation. A customized clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4 evaluation was then compared against ranking by 5 clinicians for clinical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Position-Wise Feed-Forward Layer · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Attention Dropout