Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Kuk-Jin Yoon

TL;DR
This paper introduces DiTTA, a framework that transforms image segmentation models into temporally-aware video segmentation models through efficient test-time adaptation and knowledge distillation, without requiring annotated videos.
Contribution
DiTTA is a novel method that distills temporal knowledge from foundation models into ISS models during a single pass, enabling effective video segmentation with limited video data.
Findings
DiTTA outperforms zero-shot methods using SAM2 during inference.
Achieves competitive results on VSPW and Cityscapes datasets.
Effective with only 10% of video data for adaptation.
Abstract
Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
