MaskVD: Region Masking for Efficient Video Object Detection
Sreetama Sarkar, Gourav Datta, Souvik Kundu, Kai Zheng, Chirayata, Bhattacharyya, Peter A. Beerel

TL;DR
MaskVD introduces a region masking strategy for video object detection that leverages semantic and temporal information to significantly reduce computation and latency while maintaining detection accuracy.
Contribution
This work proposes a novel region masking method for ViTs in video detection, reducing FLOPs and latency with minimal performance loss, outperforming state-of-the-art methods.
Findings
Reduces input regions by up to 80% in ViT-based detection.
Improves FLOPs and latency by 3.14x and 1.5x respectively.
Achieves 2.3x memory and 1.14x latency improvements over SOTA.
Abstract
Video tasks are compute-heavy and thus pose a challenge when deploying in real-time applications, particularly for tasks that require state-of-the-art Vision Transformers (ViTs). Several research efforts have tried to address this challenge by leveraging the fact that large portions of the video undergo very little change across frames, leading to redundant computations in frame-based video processing. In particular, some works leverage pixel or semantic differences across frames, however, this yields limited latency benefits with significantly increased memory overhead. This paper, in contrast, presents a strategy for masking regions in video frames that leverages the semantic information in images and the temporal correlation between frames to significantly reduce FLOPs and latency with little to no penalty in performance over baseline models. In particular, we demonstrate that by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Neural Network Applications · Face recognition and analysis
