TL;DR
MI-CXR is a new benchmark designed to evaluate models' ability to perform longitudinal reasoning over multi-visit chest X-ray sequences, highlighting current models' limitations in temporal understanding.
Contribution
The paper introduces MI-CXR, a comprehensive benchmark for multi-interval longitudinal reasoning in chest X-ray analysis, with a detailed evaluation of state-of-the-art models.
Findings
Models achieve only 29.3% accuracy, barely above random chance.
Models produce plausible local descriptions but struggle with temporal constraints.
MI-CXR reveals key limitations of current vision-language models in medical longitudinal reasoning.
Abstract
Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
