Dual Semantic Fusion Network for Video Object Detection
Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan,, Hanzi Wang

TL;DR
This paper introduces DSFNet, a novel video object detection model that fuses semantic information at multiple levels without external guidance, improving robustness and achieving state-of-the-art accuracy on the ImageNet VID dataset.
Contribution
The paper proposes a dual semantic fusion network that combines frame-level and instance-level semantics in a unified framework without external guidance, enhancing detection robustness.
Findings
Achieves 84.1% mAP with ResNet-101 on ImageNet VID
Achieves 85.4% mAP with ResNeXt-101 on ImageNet VID
Outperforms existing methods without post-processing steps
Abstract
Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
