Unsupervised Open-Vocabulary Object Localization in Videos

Ke Fan; Zechen Bai; Tianjun Xiao; Dominik Zietlow; Max Horn; Zixu; Zhao; Carl-Johann Simon-Gabriel; Mike Zheng Shou; Francesco Locatello; Bernt; Schiele; Thomas Brox; Zheng Zhang; Yanwei Fu; Tong He

arXiv:2309.09858·cs.CV·June 27, 2024·1 cites

Unsupervised Open-Vocabulary Object Localization in Videos

Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu, Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt, Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces an unsupervised method for localizing objects in videos by leveraging pre-trained vision-language models and slot attention, achieving strong results without manual annotations.

Contribution

It presents a novel unsupervised approach combining slot attention and CLIP for video object localization, eliminating the need for manual labels.

Findings

01

Effective localization on standard video benchmarks

02

Utilizes pre-trained CLIP model for semantic understanding

03

First unsupervised method to achieve competitive results

Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/object-centric-vol
pytorchOfficial

Videos

Unsupervised Open-Vocabulary Object Localization in Videos· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsContrastive Language-Image Pre-training