DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Yiming Cui

arXiv:2210.00588·cs.CV·October 4, 2022·1 cites

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Yiming Cui

PDF

Open Access

TL;DR

This paper introduces a dynamic feature aggregation method for video object detection that adaptively selects frames to enhance inference speed without sacrificing accuracy, significantly outperforming fixed-frame methods.

Contribution

It proposes a novel adaptive aggregation module and a deformable extension, along with an inplace distillation loss, to improve speed and efficiency of video object detectors.

Findings

01

Inference speed improved by up to 76%

02

Maintains comparable detection accuracy

03

Effective on ImageNet VID benchmark

Abstract

Video object detection is a fundamental yet challenging task in computer vision. One practical solution is to take advantage of temporal information from the video and apply feature aggregation to enhance the object features in each frame. Though effective, those existing methods always suffer from low inference speeds because they use a fixed number of frames for feature aggregation regardless of the input frame. Therefore, this paper aims to improve the inference speed of the current feature aggregation-based video object detectors while maintaining their performance. To achieve this goal, we propose a vanilla dynamic aggregation module that adaptively selects the frames for feature enhancement. Then, we extend the vanilla dynamic aggregation module to a more effective and reconfigurable deformable version. Finally, we introduce inplace distillation loss to improve the representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings