MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev; Alexandre Misrahi; Eeshaan Jain; Phil F Cheng; Petros Liakopoulos; Olivier Michielin; Michael Moor; Charlotte Bunne

arXiv:2511.20490·cs.LG·November 26, 2025

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne

PDF

Open Access 1 Video

TL;DR

MTBBench is a new benchmark simulating complex, multimodal, and longitudinal decision-making in oncology, revealing current LLMs' limitations and guiding improvements for clinical reasoning tasks.

Contribution

Introduces MTBBench, a realistic, multimodal, and longitudinal benchmark for oncology decision-making, along with an agentic framework that improves LLM reasoning and reliability.

Findings

01

Current LLMs often hallucinate and struggle with multimodal data.

02

Tool-enhanced models show performance improvements of up to 11.2%.

03

MTBBench provides a challenging environment for advancing clinical reasoning in LLMs.

Abstract

Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology· slideslive

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications