# Preclinical HistoBench: A Pilot Benchmark Dataset for Evaluating Large Language Models on Preclinical Histopathological Classification

**Authors:** Avan Kader, Marie-Luise H. H. Ranner-Hafferl, Felix Reuter, Miriam L. Fichtner, Marcus R. Makowski, Keno K. Bressem, Lisa C. Adams

PMC · DOI: 10.3390/biology15050395 · 2026-02-27

## TL;DR

This paper introduces a benchmark dataset to test how well large language models can classify preclinical histology samples, finding that they vary in performance and are not yet reliable for standalone use.

## Contribution

The paper introduces the first pilot benchmark dataset for evaluating large language models on multi-dimensional histopathological classification tasks.

## Key findings

- GPT-4.1 had the best mouse identification (70.4% sensitivity) but failed with minority species.
- Llama 3.2 uniquely identified all three species but performed poorly on mouse recognition.
- Staining classification showed Llama 3.2 with >88% sensitivity for most types, while preparation type classification was particularly challenging.

## Abstract

This study evaluates the capability of large language models to perform multi-dimensional classification of preclinical histological samples, addressing the absence of standardized benchmarks in this domain. We assessed three language models (GPT-4.1, GPT-4o-mini, and Llama 3.2) using 378 histological samples across four classification dimensions: species identification (mouse, rabbit, rat), organ recognition (kidney, liver, prostate, spleen), staining method classification (including H&E and specialized stains), and preparation technique determination (frozen versus paraffin-embedded). Our findings reveal substantial variability in model performance across tasks, with pronounced sensitivity to class imbalance. GPT-4.1 demonstrated superior performance for mouse identification (70.4% sensitivity) but failed to recognize minority species, while Llama 3.2 uniquely identified all three species despite poor mouse recognition. For staining classification, Llama 3.2 achieved the highest overall performance with greater than 88% sensitivity for most staining types. Preparation type classification proved particularly challenging, with only GPT-4.1 achieving balanced recognition of both frozen and paraffin-embedded samples. These results indicate that current large language models lack the reliability required for standalone diagnostic applications in histopathology. However, they may serve as valuable preliminary screening tools in research environments when combined with expert validation, potentially accelerating workflow efficiency while maintaining diagnostic accuracy through human oversight.

Background and Purpose: We present a pilot benchmark dataset of 378 preclinical histological samples for evaluating large language model (LLM) performance on multi-dimensional classification tasks. This dataset addresses the lack of standardized benchmarks for assessing LLMs in preclinical histopathology, encompassing species identification (mouse, rabbit, rat), organ recognition, staining methods, and preparation techniques. Methods: We evaluated the LLMs GPT-4.1, GPT-4o-mini, and Llama 3.2 on 378 histological samples across four classification dimensions: species identification (mouse, rabbit, rat), organ recognition (kidney, liver, prostate, spleen), staining method classification (H&E, Elastica van Gieson, collagen, iron, IHC-elastin, MOVAT’s pentachrome), and preparation type determination (frozen vs. paraffin-embedded). Performance was assessed using sensitivity and specificity metrics with confusion matrix analysis. Results: Model performance varied substantially across tasks and exhibited strong sensitivity to class imbalance. For preparation type classification, GPT-4.1 achieved the most balanced performance (50% frozen sensitivity, 85.7% paraffin sensitivity), while Llama 3.2 failed to recognize paraffin samples (0% sensitivity). In species classification, Llama 3.2 was the only model capable of identifying all three species (rabbit: 75% sensitivity, rat: 85.7% sensitivity) despite poor mouse recognition (0.3% sensitivity). GPT-4.1 achieved higher mouse sensitivity within this dataset (70.4% sensitivity) but failed with minority species. For staining classification, Llama 3.2 demonstrated highest overall performance, achieving >88% sensitivity for most staining types, while GPT-4o-mini showed perfect H&E recognition (100% sensitivity). Conclusions: Current LLMs demonstrate variable performance for histological classification with substantial sensitivity to class imbalance. While not suitable for standalone diagnostic use, they may serve as useful screening tools in research settings with appropriate human oversight.

## Linked entities

- **Species:** Mus musculus (taxon 10090), Rattus norvegicus (taxon 10116)

## Full-text entities

- **Chemicals:** H&amp;E (MESH:D006371), MOVAT (-), paraffin (MESH:D010232), iron (MESH:D007501)
- **Species:** Rattus norvegicus (brown rat, species) [taxon 10116], Homo sapiens (human, species) [taxon 9606], Oryctolagus cuniculus (domestic rabbit, species) [taxon 9986], Mus musculus (house mouse, species) [taxon 10090]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12984232/full.md

---
Source: https://tomesphere.com/paper/PMC12984232