Video OWL-ViT: Temporally-consistent open-world localization in video

Georg Heigold; Matthias Minderer; Alexey Gritsenko; Alex Bewley,; Daniel Keysers; Mario Lu\v{c}i\'c; Fisher Yu; Thomas Kipf

arXiv:2308.11093·cs.CV·August 23, 2023

Video OWL-ViT: Temporally-consistent open-world localization in video

Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley,, Daniel Keysers, Mario Lu\v{c}i\'c, Fisher Yu, Thomas Kipf

PDF

Open Access

TL;DR

This paper introduces Video OWL-ViT, a model that adapts open-world image models for temporally consistent object localization in videos, leveraging a transformer decoder for recurrent object tracking.

Contribution

It presents a novel architecture that extends OWL-ViT to videos, enabling end-to-end training and improved temporal consistency in open-world localization tasks.

Findings

01

Achieves better temporal consistency than tracking-by-detection baselines.

02

Successfully transfers open-world capabilities from large-scale image-text pre-training.

03

Demonstrates strong performance on the TAO-OW benchmark.

Abstract

We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications