Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?
Samik Some, Vinay P. Namboodiri

TL;DR
This paper explores reducing annotation costs in video semantic segmentation by leveraging unsupervised and coarse annotations with foundation models, achieving comparable performance with fewer manual labels.
Contribution
It demonstrates that using foundation models like SAM and SAM 2 with unannotated and coarse data can cut annotation efforts by a third without sacrificing accuracy.
Findings
Using SAM and SAM 2 automates mask generation for unannotated frames.
Dataset frame variety impacts performance more than sheer quantity.
Annotation effort can be reduced significantly with minimal performance loss.
Abstract
Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
