Identity-Consistent Aggregation for Video Object Detection

Chaorui Deng; Da Chen; Qi Wu

arXiv:2308.07737·cs.CV·August 16, 2023·1 cites

Identity-Consistent Aggregation for Video Object Detection

Chaorui Deng, Da Chen, Qi Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces ClipVID with Identity-Consistent Aggregation layers that enhance video object detection by focusing on consistent object identities across frames, achieving state-of-the-art accuracy and significantly faster processing speeds.

Contribution

The paper proposes an efficient VID model with identity-consistent aggregation that improves object representation and detection accuracy while enabling parallel clip-wise predictions.

Findings

01

Achieves 84.7% mAP on ImageNet VID dataset.

02

Runs at 39.3 fps, 7 times faster than previous methods.

03

Outperforms state-of-the-art in both accuracy and speed.

Abstract

In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bladewaltz1/clipvid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Advanced Neural Network Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Independent Component Analysis · Focus