Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

Jihun Kim; Hoyong Kwon; Hyeokjun Kweon; Kuk-Jin Yoon

arXiv:2604.10950·cs.CV·April 15, 2026

Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Kuk-Jin Yoon

PDF

TL;DR

This paper introduces DiTTA, a framework that transforms image segmentation models into temporally-aware video segmentation models through efficient test-time adaptation and knowledge distillation, without requiring annotated videos.

Contribution

DiTTA is a novel method that distills temporal knowledge from foundation models into ISS models during a single pass, enabling effective video segmentation with limited video data.

Findings

01

DiTTA outperforms zero-shot methods using SAM2 during inference.

02

Achieves competitive results on VSPW and Cityscapes datasets.

03

Effective with only 10% of video data for adaptation.

Abstract

Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.