In-N-Out Generative Learning for Dense Unsupervised Video Segmentation
Xiao Pan, Peike Li, Zongxin Yang, Huiling Zhou, Chang Zhou, Hongxia, Yang, Jingren Zhou, Yi Yang

TL;DR
This paper introduces INO generative learning, a unified framework combining image-level and pixel-level optimization for unsupervised video object segmentation using Vision Transformers, achieving state-of-the-art results.
Contribution
It proposes a novel INO generative learning approach that unifies high-level and pixel-level optimization in a single framework for VOS.
Findings
Outperforms previous state-of-the-art methods on DAVIS-2017 and YouTube-VOS datasets.
Effectively captures high-level semantics and fine-grained details.
Enhances temporal consistency in video segmentation.
Abstract
In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Contrastive Learning · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing · Multi-Head Attention
