TL;DR
This paper introduces a real-time video object detection method that efficiently propagates features across frames using short-term aggregation and motion cues, achieving high accuracy and speed on large-scale benchmarks.
Contribution
It proposes a novel short-term feature aggregation technique leveraging motion cues in compressed videos to enhance non-key frame features efficiently.
Findings
Achieves 77.2% mAP on ImageNet VID benchmark.
Runs at 30 FPS on a Titan X GPU.
Outperforms many existing methods in speed and accuracy.
Abstract
Video object detection is a fundamental problem in computer vision and has a wide spectrum of applications. Based on deep networks, video object detection is actively studied for pushing the limits of detection speed and accuracy. To reduce the computation cost, we sparsely sample key frames in video and treat the rest frames are non-key frames; a large and deep network is used to extract features for key frames and a tiny network is used for non-key frames. To enhance the features of non-key frames, we propose a novel short-term feature aggregation method to propagate the rich information in key frame features to non-key frame features in a fast way. The fast feature aggregation is enabled by the freely available motion cues in compressed videos. Further, key frame features are also aggregated based on optical flow. The propagated deep features are then integrated with the directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
