Deliberative multi-agent large language models improve clinical reasoning in ophthalmology

Ehsan Misaghi; Sean T Berkowitz; Bing Yu Chen; Qingyu Chen; Renaud Duval; Pearse A Keane; Danny A Mammo; Ariel Yuhan Ong; Mertcan Sevgi; Sumit Sharma; Sunil K Srivastava; Yih Chung Tham; Fares Antaki

arXiv:2603.21447·cs.CY·March 24, 2026

Deliberative multi-agent large language models improve clinical reasoning in ophthalmology

Ehsan Misaghi, Sean T Berkowitz, Bing Yu Chen, Qingyu Chen, Renaud Duval, Pearse A Keane, Danny A Mammo, Ariel Yuhan Ong, Mertcan Sevgi, Sumit Sharma, Sunil K Srivastava, Yih Chung Tham, Fares Antaki

PDF

Open Access

TL;DR

This study demonstrates that multi-agent deliberative councils of large language models significantly improve diagnostic accuracy and safety in ophthalmology clinical reasoning compared to individual models, reducing harm and enhancing reliability.

Contribution

It introduces a multi-agent council framework that leverages structured deliberation among LLMs to improve clinical reasoning and mitigate risks in ophthalmology diagnostics.

Findings

01

Councils outperform individual models in accuracy across tiers.

02

Harm rates are significantly reduced with councils.

03

Councils produce more complete differentials and management plans.

Abstract

Large language models (LLMs) show potential for ophthalmic clinical reasoning, yet individual models risk introducing harm. We evaluated whether multi-agent LLM deliberative councils improve diagnostic performance and mitigate harm compared to individual LLMs. In a comparative cross-sectional study, we assessed 12 individual LLMs and three multi-agent councils on 100 ophthalmology clinical vignettes. Each council comprised four models assembled by type: proprietary flagship, proprietary fast, and open-source. Models independently answered a vignette, anonymously ranked one another's responses, and a designated chair synthesized all responses and peer reviews into a final answer. Councils consistently outperformed pooled individual models across all three tiers. Accuracy improved for proprietary flagship (95.0% vs 90.8%; risk difference [RD]: 4.25 [95% CI: 0.45, 8.05]), proprietary fast…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Machine Learning in Healthcare