Video Object of Interest Segmentation
Siyuan Zhou, Chunru Zhan, Biao Wang, Tiezheng Ge, Yuning, Jiang, Li Niu

TL;DR
This paper introduces a new task called video object of interest segmentation (VOIS), which combines segmentation and tracking of relevant objects in videos based on a target image, supported by a new dataset and a transformer-based method.
Contribution
The paper proposes the VOIS task, creates the LiveVideos dataset, and develops a novel transformer-based approach for simultaneous segmentation and tracking of interest objects.
Findings
The proposed method outperforms existing approaches on the LiveVideos dataset.
The dual-path transformer effectively fuses video and image features.
Extensive experiments validate the superiority of the proposed approach.
Abstract
In this work, we present a new computer vision task named video object of interest segmentation (VOIS). Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image. This problem combines the traditional video object segmentation task with an additional image indicating the content that users are concerned with. Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos, which contains 2418 pairs of target images and live videos with instance-level annotations. In addition, we propose a transformer-based method for this task. We revisit Swin Transformer and design a dual-path structure to fuse video and image features. Then, a transformer decoder is employed to generate object proposals for segmentation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Stochastic Depth · Linear Layer
