Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4
Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua, Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei, Ting

TL;DR
This study evaluates GPT-4's ability to assess ophthalmology chatbot responses, finding strong alignment with clinicians and highlighting its potential to streamline healthcare AI validation.
Contribution
It demonstrates GPT-4's effectiveness in automatically evaluating ophthalmology chatbot responses, aligning well with clinician judgments, and identifying inaccuracies.
Findings
GPT-4 evaluation correlates highly with clinician rankings (Spearman 0.90).
GPT-4 effectively identifies clinical inaccuracies in LLM responses.
Fine-tuned LLMs show varying performance, with GPT-3.5 scoring highest.
Abstract
Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset, additional 8 glaucoma QnA pairs were included. 200 responses to the testing dataset were generated by 5 fine-tuned LLMs for evaluation. A customized clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4 evaluation was then compared against ranking by 5 clinicians for clinical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Position-Wise Feed-Forward Layer · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Attention Dropout
