# Comparative Performance of Large Language Models in Ophthalmology Referral Triage

**Authors:** Pedro Cardoso-Teixeira, João Alves Ambrósio, Mariana Garcia, João Chibante-Pedro, Lígia Figueiredo

PMC · DOI: 10.7759/cureus.102060 · 2026-01-22

## TL;DR

This study evaluates how well advanced AI systems classify Portuguese ophthalmology referrals and improves their accuracy with limited training examples.

## Contribution

The study introduces a novel evaluation of LLMs in Portuguese ophthalmology triage with supervised in-context learning.

## Key findings

- LLMs achieved 68.7% baseline accuracy, improving to 73.4% with in-context learning.
- ChatGPT 5.1 reached 79.5% peak accuracy, while ChatGPT 4o improved consistency significantly.
- Performance exceeded 90% for common categories but was lower for rare or ambiguous cases.

## Abstract

Purpose

The aim of this study was to evaluate the classification accuracy and consistency of five advanced language model-based systems (LLMs), ChatGPT 4o, ChatGPT 5.1, Perplexity Pro, Claude Sonnet 4.5, and Claude Opus 4.1, in classifying real-world Portuguese ophthalmology referral vignettes into symptom-based categories, and to assess the effect of supervised in-context learning on model performance.

Methods

A total of 3,831 real-world, anonymized ophthalmology referral vignettes written in Portuguese and collected between January and May 2023 were submitted to each system across three independent runs. In phase one, models classified referrals into one of 16 predefined symptom-based categories using a zero-shot prompting strategy. In phase two, each system was exposed to 957 labeled examples (~20% of the dataset) through in-context learning before repeating the task. Classification accuracy, consistency, and Fleiss’ kappa agreement were calculated, with additional analysis by symptom category.

Results

Baseline classification accuracy averaged 68.7% across models, improving to 73.4% post exposure. ChatGPT 5.1 achieved the highest peak accuracy (79.5%), while ChatGPT 4o showed the largest consistency gain (from 66.8% to 93.8%) and a net improvement in 933 cases (p < 0.001). Performance exceeded 90% for common referral categories, such as diabetic screening and chronic visual loss, but was lower for rare or ambiguous complaints. Inter-run agreement, measured by Fleiss’ kappa, ranged from moderate to substantial across models (κ = 0.462-0.801), with the highest agreement observed for ChatGPT 4o.

Conclusions

Advanced LLMs demonstrated strong performance in interpreting real-world Portuguese-language ophthalmology referrals, with meaningful gains in accuracy and consistency achieved through limited supervised in-context exposure. Performance was lower for rare or ambiguous referral categories. Despite this limitation, these findings support the potential role of LLMs as scalable, low-cost triage aids, provided that human oversight and further clinical validation are ensured prior to deployment.

## Full-text entities

- **Diseases:** diabetic (MESH:D003920), macular disease (MESH:D008268), Headaches (MESH:D006261), Metamorphopsia (MESH:D014786), LLMs (MESH:D007806), vitreous opacities (MESH:D003318), pterygium (MESH:D011625), strabismus (MESH:D013285), Diabetic retinopathy (MESH:D003930), ptosis (MESH:C564553), scotomas (MESH:D012607), chalazion (MESH:D017043), hallucinations (MESH:D006212), Diplopia (MESH:D004172), Epiphora (MESH:D007766), eyelid masses (MESH:D005141), cataracts (MESH:D002386), glaucoma (MESH:D005901), retinal condition (MESH:D012164)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12924494/full.md

---
Source: https://tomesphere.com/paper/PMC12924494