NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

Mohammad H. Abbasi; Favour Nerrise; Shaurnav Ghosh; Ridvan Yesiloglu; Yuncong Mao; Bailey Trang; Mohammad Asadi; Merryn Daniel; Gustavo Chau Loo Kung; Ken Chang; Pavan Pinkesh Shah; Adam Turnbull; Kyan Younes; Seena Dehkharghani; Ehsan Adeli (Stanford University)

arXiv:2605.20525·cs.CV·May 21, 2026

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli (Stanford University)

PDF

TL;DR

NeuroQA introduces a comprehensive large-scale benchmark for 3D brain MRI visual question answering, enabling evaluation of clinical reasoning skills across diverse datasets and formats.

Contribution

It provides the first extensive 3D MRI VQA benchmark with rigorous validation, diverse question types, and a focus on clinically grounded reasoning skills.

Findings

01

Models achieve below 50% accuracy, indicating room for improvement.

02

Answer-distribution refinement reduces text-only shortcut accuracy from over 80% to 44.6%.

03

Expert review ensures high-quality, contradiction-free QA pairs.

Abstract

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$ 80% to 44.6%;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.