TL;DR
ABRA introduces a comprehensive radiology-agent benchmark enabling evaluation of AI models in realistic, navigable medical imaging environments with diverse tasks and automated scoring.
Contribution
This work presents ABRA, a novel environment for radiology agents with 655 tasks, automated scoring, and evaluation of current models' perception and tool use capabilities.
Findings
Models achieve high execution accuracy but low outcome success without oracle data.
Outcome improves significantly with simulated detectors, indicating perception as a bottleneck.
ABRA's environment facilitates detailed assessment of AI radiology agents' performance.
Abstract
Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
