TL;DR
This paper introduces a deep model that uses temporal information from video frames to improve semi-supervised video segmentation, reducing the need for extensive labeled data.
Contribution
It proposes a novel end-to-end trainable architecture that effectively propagates temporal information within the network, outperforming baseline methods on CityScapes.
Findings
Significant performance improvement over frame-by-frame segmentation.
Effective use of unlabeled frames through temporal information.
Temporal guidance within the network enhances segmentation accuracy.
Abstract
In recent years, there has been remarkable progress in supervised image segmentation. Video segmentation is less explored, despite the temporal dimension being highly informative. Semantic labels, e.g. that cannot be accurately detected in the current frame, may be inferred by incorporating information from previous frames. However, video segmentation is challenging due to the amount of data that needs to be processed and, more importantly, the cost involved in obtaining ground truth annotations for each frame. In this paper, we tackle the issue of label scarcity by using consecutive frames of a video, where only one frame is annotated. We propose a deep, end-to-end trainable model which leverages temporal information in order to make use of easy to acquire unlabeled data. Our network architecture relies on a novel interconnection of two components: a fully convolutional network to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
