TL;DR
MedAtlas is a comprehensive benchmark framework designed to evaluate large language models on complex, multi-modal, multi-turn medical reasoning tasks that closely mimic real clinical workflows, highlighting current performance gaps.
Contribution
We introduce MedAtlas, a novel benchmark with multi-turn, multi-modal, multi-task medical reasoning tasks, and propose new evaluation metrics to assess model performance in realistic clinical scenarios.
Findings
Existing models show significant performance gaps in multi-stage reasoning.
MedAtlas covers diverse imaging modalities and clinical texts, reflecting real-world complexity.
Benchmark results highlight the need for more robust models in medical AI.
Abstract
Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
