MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology
Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne

TL;DR
MTBBench is a new benchmark simulating complex, multimodal, and longitudinal decision-making in oncology, revealing current LLMs' limitations and guiding improvements for clinical reasoning tasks.
Contribution
Introduces MTBBench, a realistic, multimodal, and longitudinal benchmark for oncology decision-making, along with an agentic framework that improves LLM reasoning and reliability.
Findings
Current LLMs often hallucinate and struggle with multimodal data.
Tool-enhanced models show performance improvements of up to 11.2%.
MTBBench provides a challenging environment for advancing clinical reasoning in LLMs.
Abstract
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications
