GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
Duaa Alim, Mogtaba Alim, Liam Chalcroft

TL;DR
GAZE is a novel framework enabling medical vision-language models to iteratively analyze brain MRI scans using viewer tools and literature retrieval, significantly improving diagnosis and localization of rare conditions.
Contribution
Introduces GAZE, a framework that integrates viewer-level tools and literature retrieval for medical VLMs, enhancing performance on rare brain MRI conditions without task-specific fine-tuning.
Findings
GAZE achieves 58.2 mAP for lesion localization on NOVA benchmark.
34.9% Top-1 diagnostic accuracy for brain MRI diagnosis.
Tool use disproportionately benefits rare pathologies, increasing localization IoU from 17% to 58%.
Abstract
Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
