Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Yuheng Wang; Yuji Lin; Jiayue Cai; Z. Jane Wang; Tim K. Lee

arXiv:2603.09108·cs.CV·April 21, 2026

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Yuheng Wang, Yuji Lin, Jiayue Cai, Z. Jane Wang, Tim K. Lee

PDF

TL;DR

This paper introduces a transformer-based framework for skin cancer image retrieval that combines global and local vision-language alignment to improve clinical case search.

Contribution

It proposes a hierarchical composed query representation with joint global-local alignment, enhancing retrieval accuracy in medical skin cancer datasets.

Findings

01

Achieves consistent improvements over state-of-the-art methods on Derm7pt dataset.

02

Enables efficient retrieval of relevant medical records for clinical use.

03

Supports practical deployment in medical settings.

Abstract

Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.