TL;DR
This paper introduces ViDSOD-100, a new RGB-D video dataset with high-quality annotations, and proposes ATF-Net, a novel model that effectively integrates appearance, motion, and depth information for improved salient object detection in videos.
Contribution
The paper presents a new annotated RGB-D video dataset and a baseline model that fuses multiple modalities for enhanced video saliency detection.
Findings
ATF-Net outperforms existing methods on ViDSOD-100 and DAVSOD datasets.
The multi-modality fusion approach improves detection accuracy.
Experimental results demonstrate significant performance gains over state-of-the-art techniques.
Abstract
With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
