Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models
Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas, Guibas, Katerina Fragkiadaki

TL;DR
This paper introduces a novel approach that repurposes large pre-trained static image models for open-vocabulary video tracking, enabling detection, segmentation, and re-identification of objects across frames without explicit training for tracking.
Contribution
The authors adapt static image models for open-vocabulary video tracking, combining detection, segmentation, and optical flow to track objects of any category in videos, outperforming prior methods.
Findings
Achieves strong performance on UVO and BURST benchmarks.
Outperforms previous state-of-the-art in open-world object tracking.
Can produce reasonable tracks in manipulation videos.
Abstract
Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
