Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection
Khurram Azeem Hashmi, Talha Uddin Sheikh, Didier Stricker, Muhammad, Zeshan Afzal

TL;DR
This paper introduces FAIM, a novel video object detection method that leverages instance mask features for improved temporal aggregation, achieving state-of-the-art accuracy and speed on the ImageNet VID dataset.
Contribution
The paper proposes a new instance mask-based feature aggregation approach and the FAIM method, which significantly improves video object detection by refining temporal feature aggregation.
Findings
FAIM achieves 87.9% mAP on ImageNet VID dataset.
FAIM runs at 33 FPS on a single 2080Ti GPU.
The approach is robust, method-agnostic, and effective for multi-object tracking.
Abstract
The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies, such as aggregating region proposals, often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach, significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular, we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection
MethodsBNB Customer Service Number +1-833-534-1729 · Average Pooling · Softmax · 1x1 Convolution · Global Average Pooling · Residual Connection · Batch Normalization · Convolution · CSPDarknet53 · Balanced Selection
