From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology

Zhenhao Guo; Rachit Saluja; Tianyuan Yao; Quan Liu; Junchao Zhu; Haibo Wang; Daniel Reisenb\"uchler; Yuankai Huo; Benjamin Liechty; David J. Pisapia; Kenji Ikemura; Steven Salvatoree; Surya Seshane; Mert R. Sabuncu; Yihe Yang; Ruining Deng

arXiv:2511.11984·cs.CV·November 18, 2025

From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology

Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Junchao Zhu, Haibo Wang, Daniel Reisenb\"uchler, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Steven Salvatoree, Surya Seshane, Mert R. Sabuncu, Yihe Yang, Ruining Deng

PDF

Open Access

TL;DR

This paper evaluates vision-language models for fine-grained renal pathology classification in a few-shot setting, providing insights on model adaptation, supervision, and representation geometry under clinical data constraints.

Contribution

It systematically assesses pathology-specialized and general-purpose vision-language models for renal subtyping with limited data, offering practical guidance for clinical model deployment.

Findings

01

Pathology-specialized models with vanilla fine-tuning perform best.

02

Models with 4-8 labeled examples improve discrimination and calibration.

03

Discrimination between positive and negative examples is as crucial as image-text alignment.

Abstract

Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Machine Learning in Healthcare · Retinal Imaging and Analysis