Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection
Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, Joon-Young, Lee

TL;DR
This paper introduces a simple learning framework that effectively combines detection and tracking data to improve large vocabulary video object detection, addressing supervision sparsity and catastrophic forgetting.
Contribution
The proposed framework leverages all available training data to enhance detection and tracking performance without losing recognition of LVIS categories.
Findings
Achieved consistent improvements on TAO benchmarks.
Set strong baseline results for large vocabulary video detection.
Effectively mitigated catastrophic forgetting during training.
Abstract
Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in marrying those two advances and building a strong large vocabulary video tracker. However, supervisions in LVIS and TAO are inherently sparse or even missing, posing two new challenges for training the large vocabulary trackers. First, no tracking supervisions are in LVIS, which leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO). Second, the detection supervisions in TAO are partial, which results in catastrophic forgetting of absent LVIS categories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
