VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking
Zekun Qian, Ruize Han, Junhui Hou, Linqi Song, Wei Feng

TL;DR
VOVTrack introduces a video-centric, self-supervised approach for open-vocabulary multi-object tracking, effectively localizing and associating diverse objects in videos without relying on extensive annotations.
Contribution
The paper presents VOVTrack, a novel method integrating object states and prompt-guided attention with self-supervised learning for open-vocabulary object tracking in videos.
Findings
VOVTrack outperforms existing methods on open-vocabulary tracking benchmarks.
The approach effectively leverages raw video data without annotations.
State-of-the-art results demonstrate its superiority in diverse scenarios.
Abstract
Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsSoftmax · Attention Is All You Need
