CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Pardis Taghavi; Tian Liu; Renjie Li; Reza Langari; Zhengzhong Tu

arXiv:2505.21904·cs.CV·October 10, 2025

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu

PDF

Open Access 3 Reviews

TL;DR

CAST introduces a semi-supervised distillation framework that effectively compresses large vision models into smaller, accurate instance segmentation models by leveraging contrastive learning and unlabeled data.

Contribution

The paper proposes a novel semi-supervised knowledge distillation method with an instance-aware contrastive loss for improved instance segmentation.

Findings

01

Smaller student models outperform zero-shot teachers by +8.5 and +7.1 AP.

02

CAST surpasses adapted teachers by +3.4 and +1.5 AP.

03

Outperforms state-of-the-art semi-supervised distillation methods on Cityscapes and ADE20K.

Abstract

Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The organization of this paper is clear, which is easy to follow. 2. The experiments are good, and the ablation studies are comprehensive.

Weaknesses

My main concern is mainly sourced from the insufficient discussion in related works. (1) As the core design of the method is the instance-aware pixel-wise contrastive loss, there have been many contrastive learning based knowledge distillation methods. However, the paper lacks discussion on these previous works. (2) Early works made a contrast between teachers’ and students’ features by employing different views generated from using various samples’ features [1, 2] as well as gradients [3]

Reviewer 02Rating 6Confidence 3

Strengths

1. Comprehensive Pipeline: CAST offers a well-structured SSKD pipeline that unifies teacher adaptation, knowledge distillation with a pixel-wise contrastive component, and student fine-tuning, systematically addressing the challenges of compressing large VFMs for instance segmentation. 2. Technical Rigor and Theoretical Insight: The paper details a mathematically sound instance-aware pixel-level contrastive loss, complete with negative sampling mechanisms incorporating fused mask-class cues. 3.

Weaknesses

1. While the proposal is thoughtfully implemented and the contrastive calibration is well-integrated, the primary technical innovations seem incremental. The instance-aware contrastive loss largely adapts previous contrastive/self-supervised learning techniques with a new but straightforward mask and class score fusion for negative sampling. 2. The claims regarding “robustness” under low-label regimes stem from results on only two standard datasets. Broader generalization or domain robustness (e

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper is clearly written and easy to follow. 2. The motivation of using unlabeled data to improve the labor-intensive instance segmentation scenario is good.

Weaknesses

1. The core idea is not new. There are many existing works [1, 2, 3] that explore dense contrastive learning for dense perception tasks, especially three or four years ago when contrastive learning was still very popular. 2. The gain of introducing unlabeled images are marginal. For example, on ADE20K, the supervised-only setting already achieves 23.5 mAP. Using 9x more unlabeled data only improves it by 1.0 mAP. 3. The proposed framework does not exhibit a clear empirical advantage over exist

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsKnowledge Distillation · ALIGN