STF: Spatio-Temporal Fusion Module for Improving Video Object Detection
Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir

TL;DR
This paper introduces a spatio-temporal fusion module that leverages information from consecutive video frames to enhance object detection accuracy, utilizing attention mechanisms and learnable feature merging.
Contribution
The novel STF framework combines multi-frame attention and dual-frame fusion modules to improve video object detection performance.
Findings
Improved detection accuracy on three benchmark datasets.
Effective use of attention modules for feature sharing.
Learnable fusion enhances feature robustness.
Abstract
Consecutive frames in a video contain redundancy, but they may also contain relevant complementary information for the detection task. The objective of our work is to leverage this complementary information to improve detection. Therefore, we propose a spatio-temporal fusion framework (STF). We first introduce multi-frame and single-frame attention modules that allow a neural network to share feature maps between nearby frames to obtain more robust object representations. Second, we introduce a dual-frame fusion module that merges feature maps in a learnable manner to improve them. Our evaluation is conducted on three different benchmarks including video sequences of moving road users. The performed experiments demonstrate that the proposed spatio-temporal fusion module leads to improved detection performance compared to baseline object detectors. Code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Infrared Target Detection Methodologies · Video Surveillance and Tracking Methods
