Space-time Reinforcement Network for Video Object Segmentation
Yadang Chen, Wentao Zhu, Zhi-Xin Yang, Enhua Wu

TL;DR
This paper introduces a space-time reinforcement network for video object segmentation that enhances temporal coherence and matching accuracy by generating auxiliary frames and prototype-level matching, achieving state-of-the-art results efficiently.
Contribution
It proposes a novel approach combining auxiliary frame generation and prototype-level matching to improve VOS performance and robustness against challenging data and distractors.
Findings
Outperforms state-of-the-art on DAVIS 2017 with 86.4% J&F score
Achieves 85.0% on YouTube VOS 2018
Operates at over 32 FPS for real-time inference
Abstract
Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Visual Attention and Saliency Detection · Advanced Image Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · VOS
