Bridging Images and Videos: A Simple Learning Framework for Large   Vocabulary Video Object Detection

Sanghyun Woo; Kwanyong Park; Seoung Wug Oh; In So Kweon; Joon-Young; Lee

arXiv:2212.10147·cs.CV·December 21, 2022

Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection

Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, Joon-Young, Lee

PDF

Open Access

TL;DR

This paper introduces a simple learning framework that effectively combines detection and tracking data to improve large vocabulary video object detection, addressing supervision sparsity and catastrophic forgetting.

Contribution

The proposed framework leverages all available training data to enhance detection and tracking performance without losing recognition of LVIS categories.

Findings

01

Achieved consistent improvements on TAO benchmarks.

02

Set strong baseline results for large vocabulary video detection.

03

Effectively mitigated catastrophic forgetting during training.

Abstract

Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in marrying those two advances and building a strong large vocabulary video tracker. However, supervisions in LVIS and TAO are inherently sparse or even missing, posing two new challenges for training the large vocabulary trackers. First, no tracking supervisions are in LVIS, which leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO). Second, the detection supervisions in TAO are partial, which results in catastrophic forgetting of absent LVIS categories…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques