Towards High Performance Video Object Detection for Mobiles
Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan

TL;DR
This paper introduces a lightweight, end-to-end trainable video object detection system optimized for mobile devices, combining sparse feature propagation and multi-frame aggregation to achieve high accuracy and real-time speed.
Contribution
It proposes a novel lightweight architecture with a small flow network and flow-guided GRU for efficient feature aggregation on mobiles.
Findings
Achieves 60.2% mAP at 25.6 fps on mobile devices.
Effective sparse feature propagation for non-key frames.
End-to-end training of the lightweight detection system.
Abstract
Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. It is also unclear whether the key principles of sparse feature propagation and multi-frame feature aggregation apply at very limited computational resources. In this paper, we present a light weight network architecture for video object detection on mobiles. Light weight image object detector is applied on sparse key frames. A very small network, Light Flow, is designed for establishing correspondence across frames. A flow-guided GRU module is designed to effectively aggregate features on key frames. For non-key frames, sparse feature propagation is performed. The whole network can be trained end-to-end. The proposed system achieves 60.2% mAP score at speed of 25.6 fps on mobiles (e.g., HuaWei Mate 8).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Gated Recurrent Unit
