Evaluating multiple large language models on orbital diseases

Qi-Chen Yang; Yan-Mei Zeng; Hong Wei; Cheng Chen; Qian Ling; Xiao-Yu Wang; Xu Chen; Yi Shao

PMC · DOI:10.3389/fcell.2025.1574378·July 7, 2025

Evaluating multiple large language models on orbital diseases

Qi-Chen Yang, Yan-Mei Zeng, Hong Wei, Cheng Chen, Qian Ling, Xiao-Yu Wang, Xu Chen, Yi Shao

PDF

Open Access

TL;DR

This study evaluates how well large language models like GPT-4 perform in answering questions about orbital diseases compared to medical students and ophthalmologists.

Contribution

The study introduces a dataset of orbital disease questions and compares LLMs' performance against human experts in ophthalmology.

Findings

01

GPT-4 and PaLM2 showed the highest average correlation with correct answers.

02

GPT-4 outperformed medical students but did not match ophthalmologists' accuracy.

03

LLMs like GPT-4 have potential as educational tools in ophthalmology with further refinement.

Abstract

The avoidance of mistakes by humans is achieved through continuous learning, error correction, and experience accumulation. This process is known to be both time-consuming and laborious, often involving numerous detours. In order to assist humans in their learning endeavors, ChatGPT (Generative Pre-trained Transformer) has been developed as a collection of large language models (LLMs) capable of generating responses that resemble human-like answers to a wide range of problems. In this study, we sought to assess the potential of LLMs as assistants in addressing queries related to orbital diseases. To accomplish this, we gathered a dataset consisting of 100 orbital questions, along with their corresponding answers, sourced from examinations administered to ophthalmologist residents and medical students. Five language models (LLMs) were utilized for testing and comparison purposes, namely,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

PaLM2

Proteins1

Species1

Homo sapiens(human · species)

Chemicals1

GPT-4

Diseases1

orbital diseases

Figures5

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · COVID-19 diagnosis using AI · Retinal Imaging and Analysis