Identity-Consistent Aggregation for Video Object Detection
Chaorui Deng, Da Chen, Qi Wu

TL;DR
This paper introduces ClipVID with Identity-Consistent Aggregation layers that enhance video object detection by focusing on consistent object identities across frames, achieving state-of-the-art accuracy and significantly faster processing speeds.
Contribution
The paper proposes an efficient VID model with identity-consistent aggregation that improves object representation and detection accuracy while enabling parallel clip-wise predictions.
Findings
Achieves 84.7% mAP on ImageNet VID dataset.
Runs at 39.3 fps, 7 times faster than previous methods.
Outperforms state-of-the-art in both accuracy and speed.
Abstract
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Advanced Neural Network Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Independent Component Analysis · Focus
