Guidance and Teaching Network for Video Salient Object Detection
Yingxia Jiao, Xiao Wang, Yu-Cheng Chou, Shouyuan Yang, Ge-Peng Ji,, Rong Zhu, Ge Gao

TL;DR
The paper introduces GTNet, a novel architecture for video salient object detection that effectively captures spatial-temporal cues through implicit and explicit guidance, improving accuracy in complex scenarios.
Contribution
It proposes a guidance and teaching network that decouples spatial-temporal cues and fuses cross-modal features for enhanced video saliency detection.
Findings
Achieves competitive results on three benchmarks.
Runs at approximately 28 fps on a single GPU.
Outperforms 14 state-of-the-art methods.
Abstract
Owing to the difficulties of mining spatial-temporal cues, the existing approaches for video salient object detection (VSOD) are limited in understanding complex and noisy scenarios, and often fail in inferring prominent objects. To alleviate such shortcomings, we propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet), to independently distil effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level, respectively. To be specific, we (a) introduce a temporal modulator to implicitly bridge features from motion into the appearance branch, which is capable of fusing cross-modal features collaboratively, and (b) utilise motion-guided mask to propagate the explicit cues during the feature aggregation. This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
