Video OWL-ViT: Temporally-consistent open-world localization in video
Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley,, Daniel Keysers, Mario Lu\v{c}i\'c, Fisher Yu, Thomas Kipf

TL;DR
This paper introduces Video OWL-ViT, a model that adapts open-world image models for temporally consistent object localization in videos, leveraging a transformer decoder for recurrent object tracking.
Contribution
It presents a novel architecture that extends OWL-ViT to videos, enabling end-to-end training and improved temporal consistency in open-world localization tasks.
Findings
Achieves better temporal consistency than tracking-by-detection baselines.
Successfully transfers open-world capabilities from large-scale image-text pre-training.
Demonstrates strong performance on the TAO-OW benchmark.
Abstract
We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
