GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

Duaa Alim; Mogtaba Alim; Liam Chalcroft

arXiv:2605.00876·cs.LG·May 5, 2026

GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

Duaa Alim, Mogtaba Alim, Liam Chalcroft

PDF

TL;DR

GAZE is a novel framework enabling medical vision-language models to iteratively analyze brain MRI scans using viewer tools and literature retrieval, significantly improving diagnosis and localization of rare conditions.

Contribution

Introduces GAZE, a framework that integrates viewer-level tools and literature retrieval for medical VLMs, enhancing performance on rare brain MRI conditions without task-specific fine-tuning.

Findings

01

GAZE achieves 58.2 mAP for lesion localization on NOVA benchmark.

02

34.9% Top-1 diagnostic accuracy for brain MRI diagnosis.

03

Tool use disproportionately benefits rare pathologies, increasing localization IoU from 17% to 58%.

Abstract

Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.