Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

Michelle Stegeman; Lena Philipp; Fennie van der Graaf; Marina D'Amato; Cl\'ement Grisi; Luc Builtjes; Joeran S. Bosma; Judith Lefkes; Rianne A. Weber; James A. Meakin; Thomas Koopman; Anne Mickan; Mathias Prokop; Ewoud J. Smit; Geert Litjens; Jeroen van der Laak; Bram van Ginneken; Maarten de Rooij; Henkjan Huisman; Colin Jacobs; Francesco Ciompi; Alessa Hering (and on behalf of the UNICORN consortium)

arXiv:2603.02790·cs.CV·March 4, 2026

Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

Michelle Stegeman, Lena Philipp, Fennie van der Graaf, Marina D'Amato, Cl\'ement Grisi, Luc Builtjes, Joeran S. Bosma, Judith Lefkes, Rianne A. Weber, James A. Meakin, Thomas Koopman, Anne Mickan, Mathias Prokop, Ewoud J. Smit, Geert Litjens, Jeroen van der Laak

PDF

Open Access

TL;DR

UNICORN is a comprehensive, standardized benchmark for evaluating medical foundation models across multiple modalities, tasks, and domains, enabling reproducible and comparable assessments of their generalization capabilities.

Contribution

It introduces a unified evaluation framework, a novel scoring metric, and a large, diverse dataset for benchmarking medical foundation models across various medical imaging and language tasks.

Findings

01

Benchmark includes data from over 2,400 patients across 8 countries.

02

Performance is summarized with a new UNICORN Score for cross-domain comparison.

03

Provides publicly available data, methods, and evaluation tools for reproducible research.

Abstract

Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Artificial Intelligence in Healthcare and Education · Domain Adaptation and Few-Shot Learning