EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan; Yang Bai; Ke Zou; Yang Zhou; Jun Zhou; Huazhu Fu; Yih-Chung Tham; Yong Liu

arXiv:2507.22929·cs.CL·October 2, 2025

EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu

PDF

TL;DR

This paper introduces EH-Benchmark, a new evaluation tool for ophthalmic hallucinations in medical language models, and proposes an agent-driven reasoning framework that reduces hallucinations and improves diagnostic accuracy.

Contribution

It presents EH-Benchmark for evaluating hallucinations in MLLMs and introduces an agent-centric, three-phase framework to mitigate hallucinations and enhance model reliability.

Findings

01

Multi-agent framework significantly reduces hallucinations.

02

Framework improves accuracy, interpretability, and reliability.

03

EH-Benchmark effectively evaluates hallucination types in ophthalmology.

Abstract

Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.