Unifying Tracking and Image-Video Object Detection
Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Yipin Zhou,, Xuefei Cao, Sreya Dutta Roy, Ashish Shah, Ser-Nam Lim

TL;DR
This paper introduces TrIVD, a unified end-to-end framework that combines object detection and multi-object tracking across images and videos, enabling zero-shot tracking and leveraging multi-task training.
Contribution
It is the first to unify image detection, video detection, and tracking in a single model, allowing cross-dataset training and zero-shot tracking capabilities.
Findings
TrIVD outperforms single-task baselines in detection and tracking.
The model achieves zero-shot tracking by leveraging detection data.
Multi-task training improves overall performance across tasks.
Abstract
Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, multi-object tracking (MOT) which requires reasoning about track identities and spatio-temporal trajectories, shares similar spirits with video OD. However, most MOT datasets are class-specific (e.g., person-annotated only), which constrains a model's flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. To handle the discrepancies and semantic overlaps of category labels across datasets, TrIVD formulates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
