EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow
Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu

TL;DR
This paper introduces EH-Benchmark, a new evaluation tool for ophthalmic hallucinations in medical language models, and proposes an agent-driven reasoning framework that reduces hallucinations and improves diagnostic accuracy.
Contribution
It presents EH-Benchmark for evaluating hallucinations in MLLMs and introduces an agent-centric, three-phase framework to mitigate hallucinations and enhance model reliability.
Findings
Multi-agent framework significantly reduces hallucinations.
Framework improves accuracy, interpretability, and reliability.
EH-Benchmark effectively evaluates hallucination types in ophthalmology.
Abstract
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
