BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Sahana Srinivasan; Xuguang Ai; Thaddaeus Wai Soon Lo; Aidan Gilson; Minjie Zou; Ke Zou; Hyunjae Kim; Mingjia Yang; Krithi Pushpanathan; Samantha Yew; Wan Ting Loke; Jocelyn Goh; Yibing Chen; Yiming Kong; Emily Yuelei Fu; Michelle Ongyong Hui; Kristen Nwanyanwu; Amisha Dave; Kelvin Zhenghao Li; Chen-Hsin Sun; Mark Chia; Gabriel Dawei Yang; Wendy Meihua Wong; David Ziyou Chen; Dianbo Liu; Maxwell Singer; Fares Antaki; Lucian V Del Priore; Jost Jonas; Ron Adelman; Qingyu Chen; Yih-Chung Tham

arXiv:2507.15717·cs.CL·July 22, 2025

BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Sahana Srinivasan, Xuguang Ai, Thaddaeus Wai Soon Lo, Aidan Gilson, Minjie Zou, Ke Zou, Hyunjae Kim, Mingjia Yang, Krithi Pushpanathan, Samantha Yew, Wan Ting Loke, Jocelyn Goh, Yibing Chen, Yiming Kong, Emily Yuelei Fu, Michelle Ongyong Hui, Kristen Nwanyanwu, Amisha Dave

PDF

TL;DR

BELO is a comprehensive, expert-validated benchmark for evaluating large language models' ophthalmological knowledge and reasoning, promoting fair comparison and progress in medical AI.

Contribution

This paper introduces BELO, a standardized ophthalmology benchmark with expert-reviewed questions, enabling consistent evaluation of LLMs in clinical accuracy and reasoning.

Findings

01

GPT-4o achieved highest accuracy among evaluated models.

02

BELO's questions cover diverse ophthalmology topics.

03

Human review confirmed BELO's questions are high-quality and relevant.

Abstract

Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.