Space-time Reinforcement Network for Video Object Segmentation

Yadang Chen; Wentao Zhu; Zhi-Xin Yang; Enhua Wu

arXiv:2405.04042·cs.CV·May 8, 2024

Space-time Reinforcement Network for Video Object Segmentation

Yadang Chen, Wentao Zhu, Zhi-Xin Yang, Enhua Wu

PDF

Open Access

TL;DR

This paper introduces a space-time reinforcement network for video object segmentation that enhances temporal coherence and matching accuracy by generating auxiliary frames and prototype-level matching, achieving state-of-the-art results efficiently.

Contribution

It proposes a novel approach combining auxiliary frame generation and prototype-level matching to improve VOS performance and robustness against challenging data and distractors.

Findings

01

Outperforms state-of-the-art on DAVIS 2017 with 86.4% J&F score

02

Achieves 85.0% on YouTube VOS 2018

03

Operates at over 32 FPS for real-time inference

Abstract

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Visual Attention and Saliency Detection · Advanced Image Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · VOS