Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

Bryan Wong; Jong Woo Kim; Huazhu Fu; Mun Yong Yi

arXiv:2505.17982·cs.CV·December 15, 2025

Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

Bryan Wong, Jong Woo Kim, Huazhu Fu, Mun Yong Yi

PDF

Open Access 1 Video

TL;DR

This paper introduces HiVE-MIL, a hierarchical vision-language framework that models multi-scale tissue structures and aligns visual-textual data to improve few-shot classification of whole slide images in pathology.

Contribution

It proposes a novel graph-based hierarchical model with a two-stage filtering and hierarchical contrastive loss for better multimodal and multi-scale representation in WSI classification.

Findings

01

Outperforms traditional MIL and recent VLM-based MIL methods.

02

Achieves up to 4.1% improvement in macro F1 score in 16-shot settings.

03

Demonstrates effective modeling of hierarchical tissue structures and multimodal alignment.

Abstract

Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling· slideslive

Taxonomy

TopicsAI in cancer detection · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications