Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Mina Jamshidi Idaji; Julius Hense; Tom Neuh\"auser; Augustin Krause; Yanqing Luo; Oliver Eberle; Thomas Schnake; Laure Ciernik; Farnoush Rezaei Jafari; Reza Vahidimajd; Jonas Dippel; Christoph Walz; Frederick Klauschen; Andreas Mock; Klaus-Robert M\"uller

arXiv:2603.08328·cs.CV·March 10, 2026

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Mina Jamshidi Idaji, Julius Hense, Tom Neuh\"auser, Augustin Krause, Yanqing Luo, Oliver Eberle, Thomas Schnake, Laure Ciernik, Farnoush Rezaei Jafari, Reza Vahidimajd, Jonas Dippel, Christoph Walz, Frederick Klauschen, Andreas Mock, Klaus-Robert M\"uller

PDF

Open Access

TL;DR

This paper evaluates and compares explanation methods for multiple instance learning models in histopathology, demonstrating that certain methods outperform traditional heatmaps and can provide biologically meaningful insights.

Contribution

Introduces a framework for evaluating MIL heatmaps without extra labels and benchmarks explanation methods across various models and tasks, highlighting the importance of model architecture.

Findings

01

Perturbation, LRP, and IG outperform attention-based heatmaps.

02

Model architecture and task type significantly influence explanation quality.

03

Best explanation methods enable biological validation and discovery in histopathology.

Abstract

Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Explainable Artificial Intelligence (XAI) · Radiomics and Machine Learning in Medical Imaging