Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning

Dionysis Christopoulos; Sotiris Spanos; Eirini Baltzi; Valsamis Ntouskos; Konstantinos Karantzalos

arXiv:2505.23709·cs.CV·January 23, 2026

Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning

Dionysis Christopoulos, Sotiris Spanos, Eirini Baltzi, Valsamis Ntouskos, Konstantinos Karantzalos

PDF

Open Access 3 Reviews

TL;DR

This paper presents SLIMP, a novel nested contrastive learning method that integrates image and metadata to improve skin lesion classification, addressing challenges of variability and lack of context in medical imaging.

Contribution

SLIMP introduces a new multi-modal pre-training approach that combines images and metadata for enhanced skin lesion representation learning.

Findings

01

Improved classification performance over existing methods.

02

Effective integration of image and metadata modalities.

03

Enhanced representation quality for skin lesion analysis.

Abstract

We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper proposed a well-motivated approach in contrastive pre-training for medical image-metadata pairs and offered practical solutions like continual pre-training and metadata extrapolation to adapt pre-trained methods to other datasets, even when the metadata is not available.

Weaknesses

The SLICE-3D dataset has 393 malignant samples and around 400k benign samples. On the patient level, there are patients for whom there is no malignant sample, and even for patients with malignant samples, the number of malignant samples can be very small compared to the number of benign samples. Unlike supervised training that would have a detector specifically focus on the feature representing malignant lesions, pre-training on this dataset with only optimizing the distance between the image an

Reviewer 02Rating 4Confidence 4

Strengths

1. The performance of the proposed model is better than existing methods due to the complementary information from metadata. 2. Sufficient ablation studies prove the effectiveness of the proposed modules.

Weaknesses

1. The proposed method requests a visual image, disease meta data and patients meta data. So many information requirements will constrain the model generalization in different situations. 2. The proposed method introduces contrastive learning loss between image and tabular data. What is the difference from standard contrastive learning loss? 3. Some skin disease datasets are not similar to the used dataset. How does the model ensure the transferability of these datasets? 4. The evaluation

Reviewer 03Rating 4Confidence 5

Strengths

1. The paper addresses an important problem in medical imaging—the effective use of multi-modal data for diagnosis. Improving skin lesion classification can have a direct impact on early melanoma detection. 2. The overall approach is presented clearly, and the motivation is well-explained. Figure 1 provides a good visual summary of the SLIMP architecture. 3. The paper considers the practical challenges of working with multiple medical datasets, such as diverging metadata schemas, and proposes

Weaknesses

1. The novelty of SLIMP appears to be rather limited. In Figure 1, the most crucial contribution seems to be the InfoNCE loss, which was proposed in 2018. Additionally, the feature concatenation approach is quite common. 2. The authors claim several contributions (nested loss, continual pre-training, metadata extrapolation). However, there are no ablation studies to quantify the individual impact of each component. For example, how does the nested loss compare to a "flat" contrastive loss that

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCutaneous Melanoma Detection and Management · AI in cancer detection · Face recognition and analysis

MethodsContrastive Learning