TL;DR
This paper presents a method to generate accurate video object segmentation masks from bounding box annotations by exploiting spatio-temporal consistencies, enabling weakly supervised training and improving generalization in video segmentation and tracking.
Contribution
It introduces a spatio-temporal aggregation module to mine consistencies across frames, allowing mask generation from bounding boxes for large-scale weakly supervised training.
Findings
Achieves state-of-the-art results in video object segmentation.
Improves generalization in tracking tasks.
Enables large-scale mask generation from bounding box annotations.
Abstract
Segmenting objects in videos is a fundamental computer vision task. The current deep learning based paradigm offers a powerful, but data-hungry solution. However, current datasets are limited by the cost and human effort of annotating object masks in videos. This effectively limits the performance and generalization capabilities of existing video segmentation methods. To address this issue, we explore weaker form of bounding box annotations. We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. To this end, we propose a spatio-temporal aggregation module that effectively mines consistencies in the object and background appearance across multiple frames. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. We generate segmentation masks for large scale tracking datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVOS
