Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong

TL;DR
This paper introduces a simple, effective self-supervised video object segmentation method leveraging DINO-pretrained Transformers' structural dependencies, achieving state-of-the-art results without auxiliary modalities or complex attention mechanisms.
Contribution
The authors propose a novel approach that uses a single spatio-temporal Transformer and hierarchical clustering on DINO features for self-supervised VOS, simplifying previous methods.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Excels in complex multi-object video segmentation tasks.
Operates without auxiliary modalities or iterative slot attention.
Abstract
In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Vision Transformer · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings
