Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations
Yuheng Wang, Yuji Lin, Jiayue Cai, Z. Jane Wang, Tim K. Lee

TL;DR
This paper introduces a transformer-based framework for skin cancer image retrieval that combines global and local vision-language alignment to improve clinical case search.
Contribution
It proposes a hierarchical composed query representation with joint global-local alignment, enhancing retrieval accuracy in medical skin cancer datasets.
Findings
Achieves consistent improvements over state-of-the-art methods on Derm7pt dataset.
Enables efficient retrieval of relevant medical records for clinical use.
Supports practical deployment in medical settings.
Abstract
Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
