Dual Semantic Fusion Network for Video Object Detection

Lijian Lin; Haosheng Chen; Honglun Zhang; Jun Liang; Yu Li; Ying Shan,; Hanzi Wang

arXiv:2009.07498·cs.CV·September 17, 2020

Dual Semantic Fusion Network for Video Object Detection

Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan,, Hanzi Wang

PDF

TL;DR

This paper introduces DSFNet, a novel video object detection model that fuses semantic information at multiple levels without external guidance, improving robustness and achieving state-of-the-art accuracy on the ImageNet VID dataset.

Contribution

The paper proposes a dual semantic fusion network that combines frame-level and instance-level semantics in a unified framework without external guidance, enhancing detection robustness.

Findings

01

Achieves 84.1% mAP with ResNet-101 on ImageNet VID

02

Achieves 85.4% mAP with ResNeXt-101 on ImageNet VID

03

Outperforms existing methods without post-processing steps

Abstract

Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.