Betrayed by Attention: A Simple yet Effective Approach for   Self-supervised Video Object Segmentation

Shuangrui Ding; Rui Qian; Haohang Xu; Dahua Lin; Hongkai Xiong

arXiv:2311.17893·cs.CV·July 9, 2024·1 cites

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple, effective self-supervised video object segmentation method leveraging DINO-pretrained Transformers' structural dependencies, achieving state-of-the-art results without auxiliary modalities or complex attention mechanisms.

Contribution

The authors propose a novel approach that uses a single spatio-temporal Transformer and hierarchical clustering on DINO features for self-supervised VOS, simplifying previous methods.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Excels in complex multi-object video segmentation tasks.

03

Operates without auxiliary modalities or iterative slot attention.

Abstract

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shvdiwnkozbw/ssl-uvos
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Vision Transformer · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings