Towards High Performance Video Object Detection for Mobiles

Xizhou Zhu; Jifeng Dai; Xingchi Zhu; Yichen Wei; Lu Yuan

arXiv:1804.05830·cs.CV·April 17, 2018·35 cites

Towards High Performance Video Object Detection for Mobiles

Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan

PDF

Open Access 3 Repos

TL;DR

This paper introduces a lightweight, end-to-end trainable video object detection system optimized for mobile devices, combining sparse feature propagation and multi-frame aggregation to achieve high accuracy and real-time speed.

Contribution

It proposes a novel lightweight architecture with a small flow network and flow-guided GRU for efficient feature aggregation on mobiles.

Findings

01

Achieves 60.2% mAP at 25.6 fps on mobile devices.

02

Effective sparse feature propagation for non-key frames.

03

End-to-end training of the lightweight detection system.

Abstract

Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. It is also unclear whether the key principles of sparse feature propagation and multi-frame feature aggregation apply at very limited computational resources. In this paper, we present a light weight network architecture for video object detection on mobiles. Light weight image object detector is applied on sparse key frames. A very small network, Light Flow, is designed for establishing correspondence across frames. A flow-guided GRU module is designed to effectively aggregate features on key frames. For non-key frames, sparse feature propagation is performed. The whole network can be trained end-to-end. The proposed system achieves 60.2% mAP score at speed of 25.6 fps on mobiles (e.g., HuaWei Mate 8).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Gated Recurrent Unit