Scalable Video Object Segmentation with Simplified Framework

Qiangqiang Wu; Tianyu Yang; Wei WU; Antoni Chan

arXiv:2308.09903·cs.CV·August 22, 2023·1 cites

Scalable Video Object Segmentation with Simplified Framework

Qiangqiang Wu, Tianyu Yang, Wei WU, Antoni Chan

PDF

Open Access 1 Video

TL;DR

This paper introduces SimVOS, a scalable and simplified framework for video object segmentation that uses a single transformer backbone for joint feature extraction and matching, achieving state-of-the-art results.

Contribution

The paper proposes a unified transformer-based VOS framework that simplifies design and leverages pre-trained ViT models, improving performance and efficiency.

Findings

01

Achieves state-of-the-art results on DAVIS and YouTube-VOS benchmarks.

02

Effectively utilizes pre-trained ViT backbones like MAE for VOS.

03

Introduces token refinement for faster inference.

Abstract

The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Scalable Video Object Segmentation with Simplified Framework· youtube

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsVOS · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings