PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency
Minbing Chen, Zhu Meng, Fei Su

TL;DR
PathGLS introduces a multi-dimensional, reference-free evaluation framework for pathology vision-language models, effectively detecting subtle failures like hallucinations and correlating well with clinical error hierarchies, thus enabling safer deployment.
Contribution
The paper presents PathGLS, a novel multi-dimensional evaluation method for pathology VLMs that does not require ground truth and assesses grounding, logic, and stability.
Findings
PathGLS detects a 40.2% sensitivity drop for hallucinations, outperforming BERTScore.
It achieves a Spearman's rank correlation of 0.71 with clinical error hierarchies.
PathGLS outperforms LLM-based approaches in robustness and reliability.
Abstract
Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Adversarial Robustness in Machine Learning
