Look Before You Match: Instance Understanding Matters in Video Object Segmentation
Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Chuanxin, Tang, Xiyang Dai, Yucheng Zhao, Yujia Xie, Lu Yuan, Yu-Gang, Jiang

TL;DR
This paper introduces a two-branch network that combines instance understanding with memory-based matching to improve video object segmentation, achieving state-of-the-art results on multiple benchmarks.
Contribution
It proposes a novel two-branch VOS framework integrating instance segmentation with memory matching, enhancing robustness to appearance and viewpoint changes.
Findings
Achieves state-of-the-art performance on DAVIS and YouTube-VOS benchmarks.
Outperforms existing methods by significant margins.
Effectively combines high-resolution instance features with memory readout.
Abstract
Exploring dense matching between the current frame and past frames for long-range context modeling, memory-based methods have demonstrated impressive results in video object segmentation (VOS) recently. Nevertheless, due to the lack of instance understanding ability, the above approaches are oftentimes brittle to large appearance variations or viewpoint changes resulted from the movement of objects and cameras. In this paper, we argue that instance understanding matters in VOS, and integrating it with memory-based matching can enjoy the synergy, which is intuitively sensible from the definition of VOS task, \ie, identifying and segmenting object instances within the video. Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsVOS
