Is ChatGPT-5 Ready for Mammogram VQA?
Qiang Li, Shansong Wang, Mingzhe Hu, Mojtaba Safari, Zachary Eidex, Xiaofeng Yang

TL;DR
This study evaluates GPT-5's performance on mammogram visual question answering tasks across multiple datasets, revealing it outperforms previous GPT models but still falls short of expert-level accuracy, highlighting the need for domain-specific tuning.
Contribution
First comprehensive assessment of GPT-5 on mammography VQA tasks, demonstrating its relative strengths and limitations compared to domain-specific models and human experts.
Findings
GPT-5 outperforms earlier GPT models but lags behind experts.
Performance varies across datasets and tasks, with highest accuracy on BI-RADS and malignancy classification.
Significant performance improvements from GPT-4o to GPT-5 indicate potential for future development.
Abstract
Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
