TL;DR
This paper introduces an unsupervised adversarial domain adaptation method that transforms static images to resemble video frames, improving the transferability of image-trained object detectors to videos without requiring manual annotations.
Contribution
It proposes a novel pixel-level adversarial image translation technique for unsupervised domain adaptation from images to videos, enhancing video object detection performance.
Findings
Boosts generalization of image detectors on videos
Achieves competitive results with weakly supervised methods
Demonstrates effectiveness on Youtube-Objects datasets
Abstract
Deep learning based object detectors require thousands of diversified bounding box and class annotated examples. Though image object detectors have shown rapid progress in recent years with the release of multiple large-scale static image datasets, object detection on videos still remains an open problem due to scarcity of annotated video frames. Having a robust video object detector is an essential component for video understanding and curating large-scale automated annotations in videos. Domain difference between images and videos makes the transferability of image object detectors to videos sub-optimal. The most common solution is to use weakly supervised annotations where a video frame has to be tagged for presence/absence of object categories. This still takes up manual effort. In this paper we take a step forward by adapting the concept of unsupervised adversarial image-to-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
