SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Zhixiang Lu; Shijie Xu; Kaicheng Yan; Xuyue Cai; Chong Zhang; Yulong Li; Angelos Stefanidis; Anh Nguyen; Jionglong Su

arXiv:2603.21010·cs.CV·March 24, 2026

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Zhixiang Lu, Shijie Xu, Kaicheng Yan, Xuyue Cai, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Jionglong Su

PDF

Open Access

TL;DR

SkinCLIP-VL is a resource-efficient vision-language framework that improves skin cancer diagnosis accuracy, trustworthiness, and interpretability by integrating foundation models with novel alignment techniques and outperforming larger baselines.

Contribution

The paper introduces SkinCLIP-VL, a novel, resource-efficient vision-language model with a consistency-aware alignment loss for trustworthy skin cancer diagnosis.

Findings

01

Surpasses 13B-parameter baselines by 4.3-6.2% in accuracy.

02

Uses 43% fewer parameters than comparable models.

03

Enhances clinical trust through visually grounded rationales.

Abstract

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCutaneous Melanoma Detection and Management · Multimodal Machine Learning Applications · AI in cancer detection