Scalable Video Object Segmentation with Simplified Framework
Qiangqiang Wu, Tianyu Yang, Wei WU, Antoni Chan

TL;DR
This paper introduces SimVOS, a scalable and simplified framework for video object segmentation that uses a single transformer backbone for joint feature extraction and matching, achieving state-of-the-art results.
Contribution
The paper proposes a unified transformer-based VOS framework that simplifies design and leverages pre-trained ViT models, improving performance and efficiency.
Findings
Achieves state-of-the-art results on DAVIS and YouTube-VOS benchmarks.
Effectively utilizes pre-trained ViT backbones like MAE for VOS.
Introduces token refinement for faster inference.
Abstract
The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Scalable Video Object Segmentation with Simplified Framework· youtube
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsVOS · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
