# PathVLM-Eval: Evaluation of open vision language models in histopathology

**Authors:** Nauman Ullah Gilal, Rachida Zegour, Khaled Al-Thelaya, Erdener Özer, Marco Agus, Jens Schneider, Sabri Boughorbel

PMC · DOI: 10.1016/j.jpi.2025.100455 · 2025-06-05

## TL;DR

This paper evaluates vision language models on histopathology tasks using a specialized benchmark to improve medical diagnosis and training.

## Contribution

The paper introduces an extensive benchmark and evaluation framework for VLMs in histopathology, testing over 60 models.

## Key findings

- Qwen2-VL-72B-Instruct achieved the highest average score of 63.97% across all PathMMU subsets.
- The evaluation covers diverse histopathology datasets like PubMed, SocialPath, and EduContent.
- The study provides a contamination-free assessment of VLMs in a medical imaging context.

## Abstract

The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, particularly digital pathology, could significantly benefit from VLMs for histological interpretation and diagnosis, enabling pathologists to use a complementary tool for faster morecomprehensive reporting and efficient healthcare service. In this work, we are interested in benchmarking VLMs on histopathology image understanding. We present an extensive evaluation of recent VLMs on the PathMMU dataset, a domain-specific benchmark that includes subsets such as PubMed, SocialPath, and EduContent. These datasets feature diverse formats, notably multiple-choice questions (MCQs), designed to aid pathologists in diagnostic reasoning and support professional development initiatives in histopathology. Utilizing VLMEvalKit, a widely used open-source evaluation framework—we bring publicly available pathology datasets under a single evaluation umbrella, ensuring unbiased and contamination-free assessments of model performance. Our study conducts extensive zero-shot evaluations of more than 60 state-of-the-art VLMs, including LLaVA, Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3, MOLMO, and XComposer series, significantly expanding the range of evaluated models compared to prior literature. Among the tested models, Qwen2-VL-72B-Instruct achieved superior performance with an average score of 63.97% outperforming other models across all PathMMU subsets. We conclude that this extensive evaluation will serve as a valuable resource, fostering the development of next-generation VLMs for analyzing digital pathology images. Additionally, we have released the complete evaluation results on our leaderboard PathVLM-Eval: https://huggingface.co/spaces/gilalnauman/PathVLMs.

## Full-text entities

- **Diseases:** VL (MESH:C536141), cancerous (MESH:D009369), VLMs (MESH:D014786), MCQs (MESH:C538270)
- **Chemicals:** MCQs (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** MolmoE-1B — Homo sapiens (Human), Childhood B acute lymphoblastic leukemia, Cancer cell line (CVCL_QW66)

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12276438/full.md

---
Source: https://tomesphere.com/paper/PMC12276438