Unifying Tracking and Image-Video Object Detection

Peirong Liu; Rui Wang; Pengchuan Zhang; Omid Poursaeed; Yipin Zhou,; Xuefei Cao; Sreya Dutta Roy; Ashish Shah; Ser-Nam Lim

arXiv:2211.11077·cs.CV·November 21, 2023

Unifying Tracking and Image-Video Object Detection

Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Yipin Zhou,, Xuefei Cao, Sreya Dutta Roy, Ashish Shah, Ser-Nam Lim

PDF

Open Access

TL;DR

This paper introduces TrIVD, a unified end-to-end framework that combines object detection and multi-object tracking across images and videos, enabling zero-shot tracking and leveraging multi-task training.

Contribution

It is the first to unify image detection, video detection, and tracking in a single model, allowing cross-dataset training and zero-shot tracking capabilities.

Findings

01

TrIVD outperforms single-task baselines in detection and tracking.

02

The model achieves zero-shot tracking by leveraging detection data.

03

Multi-task training improves overall performance across tasks.

Abstract

Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, multi-object tracking (MOT) which requires reasoning about track identities and spatio-temporal trajectories, shares similar spirits with video OD. However, most MOT datasets are class-specific (e.g., person-annotated only), which constrains a model's flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. To handle the discrepancies and semantic overlaps of category labels across datasets, TrIVD formulates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications