AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning
Suzhong Fu, Jingqi Dong, Xuan Ding, Rui Sun, Yiming Yang, Shuguang Cui, Zhen Li

TL;DR
AgentsEval introduces a multi-agent reasoning framework that emulates radiologists' diagnostic workflow to evaluate medical imaging reports more reliably, providing explicit reasoning and clinical feedback for trustworthy AI integration.
Contribution
The paper presents a novel multi-agent stream reasoning framework for evaluating medical reports, capturing diagnostic logic and providing interpretable, clinically aligned assessments.
Findings
Outperforms existing evaluation methods in clinical relevance.
Robust under paraphrastic and semantic perturbations.
Provides explicit reasoning traces and structured feedback.
Abstract
Evaluating the clinical correctness and reasoning fidelity of automatically generated medical imaging reports remains a critical yet unresolved challenge. Existing evaluation methods often fail to capture the structured diagnostic logic that underlies radiological interpretation, resulting in unreliable judgments and limited clinical relevance. We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists. By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback. We also construct a multi-domain perturbation-based benchmark covering five medical report datasets with diverse imaging modalities and controlled semantic variations. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multimodal Machine Learning Applications
