Siamese Network with Interactive Transformer for Video Object Segmentation
Meng Lan, Jing Zhang, Fengxiang He, Lefei Zhang

TL;DR
This paper introduces SITVOS, a Siamese network with an interactive transformer that effectively propagates spatio-temporal context for semi-supervised video object segmentation, achieving superior results on benchmark datasets.
Contribution
The paper presents a novel Siamese network with an interactive transformer and feature interaction module for improved context propagation in VOS.
Findings
Outperforms state-of-the-art methods on three benchmarks.
Efficient feature reuse via Siamese architecture.
Effective spatio-temporal context encoding with transformer.
Abstract
Semi-supervised video object segmentation (VOS) refers to segmenting the target object in remaining frames given its annotation in the first frame, which has been actively studied in recent years. The key challenge lies in finding effective ways to exploit the spatio-temporal context of past frames to help learn discriminative target representation of current frame. In this paper, we propose a novel Siamese network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames. Technically, we use the transformer encoder and decoder to handle the past frames and current frame separately, i.e., the encoder encodes robust spatio-temporal context of target object from the past frames, while the decoder takes the feature embedding of current frame as the query to retrieve the target from the encoder output. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsSiamese Network
