TL;DR
This paper introduces a new RGB-D salient object detection dataset, conducts a comprehensive benchmark of existing models, and proposes a novel deep learning architecture, D3Net, that outperforms previous methods and enables real-time applications.
Contribution
The paper provides a new high-quality dataset, a large-scale benchmark, and a novel deep architecture for RGB-D salient object detection, advancing the field significantly.
Findings
D3Net outperforms previous models across all metrics.
The new SIP dataset covers diverse real-world scenes.
D3Net achieves real-time processing at 65fps.
Abstract
The use of RGB-D information for salient object detection has been extensively explored in recent years. However, relatively few efforts have been put towards modeling salient object detection in real-world human activity scenes with RGBD. In this work, we fill the gap by making the following contributions to RGB-D salient object detection. (1) We carefully collect a new SIP (salient person) dataset, which consists of ~1K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds. (2) We conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research. We systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven datasets containing a total of about 97K images.…
| No. | Dataset | Year | Pub. | DS. | #Obj. | Types. | Sensor. | DQ. | AQ. | GI. | CB. | Resolution (HW) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | STERE [63] | 2012 | CVPR | 1K | one | internet | Stereo camera+sift flow [54] | High | No | High | [2511200][222900] | |
| 2 | GIT [36] | 2013 | BMVC | 0.08K | multiple | home environment | Microsoft Kinect [52] | High | No | Low | 640 480 | |
| 3 | LFSD [64] | 2014 | CVPR | 0.1K | one | 60 indoor/40 outdoor | Lytro Illum camera [53] | High | No | High | 360 360 | |
| 4 | DES [38] | 2014 | ICIMCS | 0.135K | one | 135 indoor | Microsoft Kinect [52] | High | No | High | 640 480 | |
| 5 | NLPR [39] | 2014 | ECCV | 1K | multiple | indoor/outdoor | Microsoft Kinect [52] | High | No | High | 640 480, 480 640 | |
| 6 | NJU2K [37] | 2014 | ICIP | 1.985K | one | 3D movie/internet/photo | FujiW3 camera+optical flow [65] | High | No | High | [2311213][274828] | |
| 7 | SSD [66] | 2017 | ICCVW | 0.08K | multiple | three stereo movies | Sun’s optical flow [65] | No | Low | 960 1080 | ||
| 8 | SIP (Ours) | 2020 | TNNLS | 0.929K | multiple | person in the wild | Huawei Mate10 | High | High | Yes | Low | 992744 |
| No. | Model | Year | Pub. | Train/Val Set. (#) | Test (#) | Basic | Type | SP. | E-measure [59] |
|---|---|---|---|---|---|---|---|---|---|
| 1 | LS [36] | 2013 | BMVC | Without training dataset | One | Markov Random Field | T | ✓ | Not Available |
| 2 | RC [73] | 2013 | BMVC | Without training dataset | One | Region Contrast, SVM [74] | T | Not available | |
| 3 | LHM [39] | 2014 | ECCV | Without training dataset | One | Multi-Context Contrast | T | ✓ | 0.6530.771 |
| 4 | DESM [38] | 2014 | ICIMCS | Without training dataset | One | Color/Depth Contrast, Spatial Bias Prior | T | 0.7700.868 | |
| 5 | ACSD [37] | 2014 | ICIP | Without training dataset | One | Difference of Gaussian | T | ✓ | 0.7800.850 |
| 6 | SRDS [75] | 2014 | DSP | Without training dataset | One | Weighted Color Contrast | T | Not available | |
| 7 | GP [40] | 2015 | CVPRW | Without training dataset | Two | Markov Random Field, 4Priors | T | ✓ | 0.6700.824 |
| 8 | PRC [62] | 2016 | Access | Without training dataset | Two | Region Classification, RFR | T | Not available | |
| 9 | LBE [41] | 2016 | CVPR | Without training dataset | Two | Angular Density Component | T | ✓ | 0.7360.890 |
| 10 | DCMC [55] | 2016 | SPL | Without training dataset | Two | Depth Confidence, Compactness, Graph | T | ✓ | 0.7430.856 |
| 11 | SE [42] | 2016 | ICME | Without training dataset | Two | Cellular Automata | T | ✓ | 0.7710.856 |
| 12 | MCLP [67] | 2017 | Cybernetic | Without training dataset | Two | Addition, Deletion and Iteration Scheme | T | ✓ | Not available |
| 13 | TPF [66] | 2017 | ICCVW | Without training dataset | Four | Cellular Automata, Optical Flow | T | ✓ | Not available |
| 14 | CDCP [46] | 2017 | ICCVW | Without training dataset | Two | Center-dark Channel Prior | T | ✓ | 0.7000.820 |
| 15 | DF [44] | 2017 | TIP | NLR (0.75K) + NJU (1.0K) | Three | Laplacian Propagation, LGBS Priors | D | ✓ | 0.7590.880 |
| 16 | BED [76] | 2017 | ICCVW | NLR (0.80K) + NJU (1.6K) + MK (9K) | Two | Background Enclosure Distribution | D | ✓ | Not available |
| 17 | MDSF [45] | 2017 | TIP | NLR (0.50K) + NJU (0.5K) | Two | SVM [74], RFR, Ultrametric Contour Map | T | 0.7790.885 | |
| 18 | MFF [77] | 2017 | SPL | Without training dataset | One | Minimum Barrier Distance, 3D prior | T | Not available | |
| 19 | Review [56] | 2018 | TCSVT | Without training dataset | Two | Without model introduced | T | Not available | |
| 20 | HSCS [68] | 2018 | TMM | Without training dataset | Two | Hierarchical Sparsity, Energy Function | T | ✓ | Not available |
| 21 | ICS [69] | 2018 | TIP | Without training dataset | One | MCFM, CLP | T | ✓ | Not available |
| 22 | CDB [47] | 2018 | NC | Without training dataset | One | Background Prior | T | ✓ | 0.6980.830 |
| 23 | SCDL [78] | 2018 | DSP | NLR (0.75K) + NJU (1.0K) | Two | Silhouette Feature, Spatial Coherence Loss | D | Not available | |
| 24 | PCF [49] | 2018 | CVPR | NLR (0.70K) + NJU (1.5K) | Three | Complementarity-Aware Fusion module [49] | D | 0.8270.925 | |
| 25 | CTMF [43] | 2018 | Cybernetic | NLR (0.65K) + NJU (1.4K) | Four | HHA [79], IPT, Hidden Structure Transfer | D | 0.8290.932 | |
| 26 | ACCF [80] | 2018 | IROS | NLR (0.65K) + NJU (1.4K) | Three | Attention-Aware | D | Not available | |
| 27 | PDNet [48] | 2019 | ICME | NLR (0.50K) + NJU (1.5K) + O (21K) | Five | Depth-Enhanced Net [48] | D | Not available | |
| 28 | AFNet [61] | 2019 | Access | NLR (0.70K) + NJU (1.5K) | Three | Switch map, Edge-Aware loss | D | 0.8070.887 | |
| 29 | MMCI [81] | 2019 | PR | NLR (0.70K) + NJU (1.5K) | Three | HHA [79], Dilated Convolutional | D | 0.8390.928 | |
| 30 | TANet [82] | 2019 | TIP | NLR (0.70K) + NJU (1.5K) | Three | Attention-Aware Multi-Modal Fusion | D | 0.8470.941 | |
| 31 | CPFP [51] | 2019 | CVPR | NLR (0.70K) + NJU (1.5K) | Five | Contrast Prior, Fluid Pyramid | D | 0.8520.932 | |
| 32 | D3Net (Ours) | 2020 | NLR (0.70K) + NJU (1.5K) | Seven | Depth Depurator Unit | D | 0.8620.953 |
| Background Objects | Object Boundary | # Object | |||||||||||
| SIP (Ours) | car | flower | grass | road | tree | signs | barrier | other | dark | clear | 1 | 2 | 3 |
| #Img | 107 | 9 | 154 | 140 | 97 | 25 | 366 | 32 | 162 | 767 | 591 | 159 | 179 |
| 2014-2017 | 2018-2019 | ||||||||||||||||||
| * | Model | LHM | CDB | DESM | GP | CDCP | ACSD | LBE | DCMC | MDSF | SE | DF | AFNet | CTMF | MMCI | PCF | TANet | CPFP | D3Net |
| [39] | [47] | [38] | [40] | [46] | [37] | [41] | [55] | [45] | [42] | [44]† | [61]† | [43]† | [81]† | [49]† | [82]† | [51]† | Ours† | ||
| Time (s) | 2.130 | - | 7.790 | 12.98 | 60.0 | 0.718 | 3.110 | 1.200 | 60.0 | 1.570 | 10.36 | 0.030 | 0.630 | 0.050 | 0.060 | 0.070 | 0.170 | 0.015 | |
| Code | M | - | M | M&C | M&C | C | M&C | M | C | M&C | M&C | Tf | Caffe | Caffe | Caffe | Caffe | Caffe | Pytorch | |
| NJU-T[37] | .514 | .624 | .665 | .527 | .669 | .699 | .695 | .686 | .748 | .664 | .763 | .772 | .849 | .858 | .877 | .878 | .879 | .900 | |
| .632 | .648 | .717 | .647 | .621 | .711 | .748 | .715 | .775 | .748 | .804 | .775 | .845 | .852 | .872 | .874 | .877 | .900 | ||
| .724 | .742 | .791 | .703 | .741 | .803 | .803 | .799 | .838 | .813 | .864 | .853 | .913 | .915 | .924 | .925 | .926 | .950 | ||
| .205 | .203 | .283 | .211 | .180 | .202 | .153 | .172 | .157 | .169 | .141 | .100 | .085 | .079 | .059 | .060 | .053 | .041 | ||
| 17 | 16 | 14 | 17 | 15 | 12 | 10 | 13 | 9 | 11 | 7 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | ||
| STERE[63] | .562 | .615 | .642 | .588 | .713 | .692 | .660 | .731 | .728 | .708 | .757 | .825 | .848 | .873 | .875 | .871 | .879 | .899 | |
| .683 | .717 | .700 | .671 | .664 | .669 | .633 | .740 | .719 | .755 | .757 | .823 | .831 | .863 | .860 | .861 | .874 | .891 | ||
| .771 | .823 | .811 | .743 | .786 | .806 | .787 | .819 | .809 | .846 | .847 | .887 | .912 | .927 | .925 | .923 | .925 | .938 | ||
| .172 | .166 | .295 | .182 | .149 | .200 | .250 | .148 | .176 | .143 | .141 | .075 | .086 | .068 | .064 | .060 | .051 | .046 | ||
| 16 | 12 | 14 | 18 | 13 | 15 | 17 | 10 | 11 | 9 | 8 | 7 | 6 | 3 | 4 | 5 | 2 | 1 | ||
| DES[38] | .578 | .645 | .622 | .636 | .709 | .728 | .703 | .707 | .741 | .741 | .752 | .770 | .863 | .848 | .842 | .858 | .872 | .898 | |
| .511 | .723 | .765 | .597 | .631 | .756 | .788 | .666 | .746 | .741 | .766 | .728 | .844 | .822 | .804 | .827 | .846 | .885 | ||
| .653 | .830 | .868 | .670 | .811 | .850 | .890 | .773 | .851 | .856 | .870 | .881 | .932 | .928 | .893 | .910 | .923 | .946 | ||
| .114 | .100 | .299 | .168 | .115 | .169 | .208 | .111 | .122 | .090 | .093 | .068 | .055 | .065 | .049 | .046 | .038 | .031 | ||
| 18 | 13 | 14 | 17 | 16 | 12 | 10 | 15 | 11 | 9 | 7 | 8 | 3 | 5 | 6 | 4 | 2 | 1 | ||
| NLR-T[39] | .630 | .629 | .572 | .654 | .727 | .673 | .762 | .724 | .805 | .756 | .802 | .799 | .860 | .856 | .874 | .886 | .888 | .912 | |
| .622 | .618 | .640 | .611 | .645 | .607 | .745 | .648 | .793 | .713 | .778 | .771 | .825 | .815 | .841 | .863 | .867 | .897 | ||
| .766 | .791 | .805 | .723 | .820 | .780 | .855 | .793 | .885 | .847 | .880 | .879 | .929 | .913 | .925 | .941 | .932 | .953 | ||
| .108 | .114 | .312 | .146 | .112 | .179 | .081 | .117 | .095 | .091 | .085 | .058 | .056 | .059 | .044 | .041 | .036 | .030 | ||
| 14 | 15 | 16 | 18 | 12 | 17 | 10 | 13 | 7 | 11 | 8 | 8 | 5 | 6 | 4 | 3 | 2 | 1 | ||
| SSD[66] | .566 | .562 | .602 | .615 | .603 | .675 | .621 | .704 | .673 | .675 | .747 | .714 | .776 | .813 | .841 | .839 | .807 | .857 | |
| .568 | .592 | .680 | .740 | .535 | .682 | .619 | .711 | .703 | .710 | .735 | .687 | .729 | .781 | .807 | .810 | .766 | .834 | ||
| .717 | .698 | .769 | .782 | .700 | .785 | .736 | .786 | .779 | .800 | .828 | .807 | .865 | .882 | .894 | .897 | .852 | .910 | ||
| .195 | .196 | .308 | .180 | .214 | .203 | .278 | .169 | .192 | .165 | .142 | .118 | .099 | .082 | .062 | .063 | .082 | .058 | ||
| 16 | 17 | 15 | 11 | 17 | 13 | 14 | 9 | 12 | 9 | 7 | 8 | 6 | 4 | 2 | 2 | 5 | 1 | ||
| LFSD[64] | .553 | .515 | .716 | .635 | .712 | .727 | .729 | .753 | .694 | .692 | .783 | .738 | .788 | .787 | .786 | .801 | .828 | .825 | |
| .708 | .677 | .762 | .783 | .702 | .763 | .722 | .817 | .779 | .786 | .813 | .744 | .787 | .771 | .775 | .796 | .826 | .810 | ||
| .763 | .766 | .811 | .824 | .780 | .829 | .797 | .856 | .819 | .832 | .857 | .815 | .857 | .839 | .827 | .847 | .872 | .862 | ||
| .218 | .225 | .253 | .190 | .172 | .195 | .214 | .155 | .197 | .174 | .145 | .133 | .127 | .132 | .119 | .111 | .088 | .095 | ||
| 17 | 18 | 16 | 12 | 15 | 11 | 14 | 6 | 13 | 9 | 5 | 10 | 4 | 7 | 8 | 3 | 1 | 2 | ||
| SIP (Ours) | .511 | .557 | .616 | .588 | .595 | .732 | .727 | .683 | .717 | .628 | .653 | .720 | .716 | .833 | .842 | .835 | .850 | .860 | |
| .574 | .620 | .669 | .687 | .505 | .763 | .751 | .618 | .698 | .661 | .657 | .712 | .694 | .818 | .838 | .830 | .851 | .861 | ||
| .716 | .737 | .770 | .768 | .721 | .838 | .853 | .743 | .798 | .771 | .759 | .819 | .829 | .897 | .901 | .895 | .903 | .909 | ||
| .184 | .192 | .298 | .173 | .224 | .172 | .200 | .186 | .167 | .164 | .185 | .118 | .139 | .086 | .071 | .075 | .064 | .063 | ||
| 17 | 16 | 14 | 12 | 18 | 6 | 9 | 14 | 10 | 11 | 13 | 7 | 8 | 5 | 3 | 4 | 2 | 1 | ||
| 18 | 17 | 15 | 14 | 16 | 13 | 12 | 11 | 10 | 9 | 7 | 8 | 6 | 5 | 4 | 3 | 2 | 1 | ||
| Aspects | Model | SIP (Ours) | STERE [63] | DES [38] | LFSD [64] | SSD [66] | NJU2K [37] | NLPR [39] |
|---|---|---|---|---|---|---|---|---|
| w/o DDU | RgbNet | 0.831 | 0.893 | 0.881 | 0.810 | 0.839 | 0.888 | 0.911 |
| RgbdNet | 0.862 | 0.898 | 0.896 | 0.836 | 0.857 | 0.898 | 0.910 | |
| DepthNet | 0.862 | 0.713 | 0.911 | 0.724 | 0.811 | 0.857 | 0.864 | |
| DDU | Lower Bound | 0.822 | 0.881 | 0.870 | 0.788 | 0.817 | 0.875 | 0.897 |
| D3Net (Ours) | 0.860 | 0.899 | 0.898 | 0.825 | 0.857 | 0.900 | 0.912 | |
| Upper Bound | 0.872 | 0.910 | 0.907 | 0.858 | 0.879 | 0.912 | 0.924 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks
Deng-Ping Fan, Zheng Lin, Zhao Zhang, Menglong Zhu, and Ming-Ming Cheng
D.-P. Fan, Z. Lin, Z. Zhang, and M.-M. Cheng are with the College of Computer Science, Nankai University, Tianjin, China M. Zhu is with the Google AI, USA. M.-M. Cheng is the corresponding author (email: [email protected]). Manuscript received July 16, 2019; revised March 10, 2020.
Abstract
The use of RGB-D information for salient object detection has been extensively explored in recent years. However, relatively few efforts have been put towards modelling salient object detection in real-world human activity scenes with RGB-D. In this work, we fill the gap by making the following contributions to RGB-D salient object detection. (1) We carefully collect a new SIP (salient person) dataset, which consists of 1K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds. (2) We conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research. We systematically summarize 32 popular models, and evaluate 18 parts of 32 models on seven datasets containing a total of about 97K images. (3) We propose a simple general architecture, called Deep Depth-Depurator Network (D3Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning respectively. These components form a nested structure and are elaborately designed to be learned jointly. D3Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D3Net can be used to efficiently extract salient object masks from real scenes, enabling effective background changing application with a speed of 65fps on a single GPU. All the saliency maps, our new SIP dataset, the D3Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark.
Index Terms:
Benchmark, SIP Dataset, Salient Object Detection, Saliency, RGB-D.
I Introduction
How to take high-quality photos has become one of the most important competition points among mobile phone manufacturers. Salient object detection (SOD) methods [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] have been incorporated into mobile phones and been widely used for creating perfect portraits by automatically adding large aperture and other enhancement effects. While existing SOD methods [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] have achieved remarkable success, most of them only rely on RGB images and ignore the important depth information, which is widely available in modern smartphones (e.g., iPhone X, Huawei Mate10, and Samsung Galaxy S10). Thus, fully utilizing RGB-D information for SOD detection has recently attracted significant research attention [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51].
One of the primary goals of existing smartphone cameras is to identify humans in visual scenes, through either coarse, bounding-box-level, or instance-level; segmentation. To this end, intelligence solutions, such as RGB-D saliency detecting techniques have gained considerable attention.
However, most existing RGB-D based SOD methods are tested on RGB-D images taken by Kinect [52] or a light field camera [53], or estimated by optical flow [54], which have different characteristics from actual smartphone cameras. Since humans are the key subjects of photographs taken with smartphones, a human-oriented RGB-D dataset featuring realistic, in-the-wild images would be more useful for mobile manufacturers. Despite the effort of some authors [37, 39] to augment their scenes with additional objects, a human-centered RGB-D dataset for salient object detection does not yet exist.
Furthermore, although depth maps provide important complementary information for identifying salient objects, the low-quality versions often cause wrong detections [55]. While existing RGB-D based SOD models typically fuse RGB and depth features by different strategies [51]. There is no model that explicitly/automatically discard the low-quality depth map in the RGB-D SOD field. We believe such models have a high potential for driving this field forward.
In addition to the limitations of current RGB-D datasets and models already mentioned, most RGB-D studies also suffer from several other common constraints, including:
Sufficiency. Only a limited number of datasets (14) have been benchmarked in recent papers [39, 56] (Table II). The generalizability of models cannot be properly accessed with such a small number of datasets.
Completeness. F-measure [57], MAE, and PR (precision & recall) Curve are the three most widely-used metrics in existing works. However, as suggested by [58, 59], these metrics essentially act at a pixel-level. It is thus difficult to draw thorough and reliable conclusions from quantitative evaluations [60].
Fairness. Some works [51, 61, 49] use the same F-measure metric, but do not explicitly describe which statistic (e.g., mean or max) was used, easily resulting in unfair comparison and inconsistent performance. Meanwhile, the different threshold strategies for F-measure (e.g., 255 varied thresholds [61, 51, 62], adaptive saliency threshold [39, 41], and self-adaptive threshold [43]) will result in different performance. It is thus of crucial need to provide a fair comparison of RGB-D based SOD models by extensively evaluating them with same metrics on a standard leaderboard.
I-A Contribution
To address the above-mentioned problems, we provide three distinct contributions.
(1) We have built a new Salient Person (SIP) dataset (see Fig. 2, Fig. 3). It consists of 929 accurately annotated high-resolution images which are designed to contain multiple salient persons per image. It is worth mentioning that the depth maps are captured by a real smartphone. We believe such a dataset is highly valuable and will facilitate the application of RGB-D models to mobile devices. Besides, the dataset is carefully designed to cover diverse scenes, various challenging situations (e.g., occlusion, appearance change), and elaborately annotated with pixel-level ground truths (GT). Another discriminative feature of our SIP dataset is the availability of both RGB and grayscale images captured by a binocular camera, which can benefit a broad number of research directions, such as, stereo matching, depth estimation, human-centered detection, etc.
(2) With the proposed SIP and six existing RGB-D datasets [37, 63, 38, 39, 66, 64], we provide a more comprehensive comparison of 32 classical RGB-D salient object detection models and present the large-scale (97K images) fair evaluation of 18 state-of-the-art (SOTA) algorithms [39, 38, 37, 40, 41, 55, 42, 46, 44, 45, 67, 68, 69, 47, 49, 43], making our study a good all-around RGB-D benchmark. To further promote the development of this field, we additionally provide an online evaluation platform with the preserved test set.
(3) We propose a simple general model called Deep Depth-Depurator Network (D3Net), which learns to automatically discard low-quality depth maps using a novel depth depurator unit (DDU). Thanks to the gate connection mechanism, our D3Net can predict salient objects accurately. Extensive experiments demonstrate that our D3Net remarkably outperforms prior work on many challenging datasets. Such a general framework design helps to learn cross-modality features from RGB images and depth maps.
Our contributions offer a systematic benchmark equipped with the basic tools for comprehensive assessment of RGB-D models, offering deep insight into the task of RGB-D based modelling and encouraging future research in this direction.
I-B Organization
In II, we first review current datasets for RGB-D salient object detection, as well as representative models for this task. Then, we present details on the proposed salient person dataset SIP in III. In IV, we describe our D3Net model for RGB-D salient object detection by explicitly filtering out the low-quality depth maps.
In V, we provide both a quantitative and qualitative experimental analysis of the proposed algorithm. Specifically, in V-A, we offer more details on our experimental settings, including the benchmarked models, datasets and runtime. In V-B, five evaluation metrics (E-measure [59], S-measure [58], MAE, PR Curve, and F-measure [57]) are described in detail. In V-C, we provide the mean statistics over different datasets and summarize them in Table IV. comparison results of 18 SOTA RGB-D based SOD models over seven datasets, namely STERE [63], LFSD [64], DES [38], NLPR [39], NJU2K [37], SSD [66], and SIP (Ours) clearly demonstrate the robustness and efficiency of our D3Net model. Further, in V-D, we provide a performance comparison between traditional and deep models. We also discuss the experimental results in more depth. In V-E, we provide visualizations of the results and present saliency maps generated for various challenging scenes. In VI, we discuss some potential applications about human activities and provide an interesting and realistic use scenario of D3Net in a background changing application. To better understand the contributions of DDU in the proposed D3Net, in VII, we present the upper and lower bound of the DDU. All in all, the extensive experimental results clearly demonstrate that our D3Net model exceeds the performance of any prior competitors across five different metrics. In VII-B, we discuss the limitations of this work. Finally, VIII concludes the paper.
II Related Works
II-A RGB-D Datasets
Over the past few years, several RGB-D datasets have been constructed for SOD. Some statistics of these datasets are shown in Table I. Specifically, the STERE [63] dataset was the first collection of stereoscopic photos in this field. GIT [36], LFSD [64] and DES [64] are three small-sized datasets. GIT and LFSD were designed with specific purposes in mind, e.g., saliency-based segmentation of generic objects, and saliency detection on the light field. DES has 135 indoor images captured by Microsoft Kinect [52]. Although these datasets have advanced the field to various degrees, they are severely restricted by their small scale or low resolution. To overcome these barriers, Peng *et al. *created NLPR [39], a large-scale RGB-D dataset with a resolution of 640480. Later, Ju *et al. *built NJU2K [37], which has become one of the most popular RGB-D datasets. The recent SSD [66] dataset partially remedied the resolution restriction of NLPR and NJU2K. However, it only contains 80 images. Despite the progress made by existing RGB-D datasets, they still suffer from the common limitation of not capturing depth maps in the real smartphones, making them unsuitable for reflecting real environmental conditions (e.g., lighting or distance to object).
Compared to previous datasets, the proposed SIP dataset has three fundamental differences:
- •
It includes 929 images with many challenging situations [83] (e.g., dark background, occlusion, appearance change, and out-of-view) from various outdoor scenarios.
- •
The RGB, grayscale images, and estimated depth maps are captured by a smartphone with a dual-camera. Due to the predominant application of SOD to human subjects on mobile phones, we also focus on this and thus and thus, for the first time, emphasize the salient persons in the real-world scenes.
- •
A detailed quantitative analysis is presented for the quality of the dataset (e.g., center bias, object size distribution, etc.), which was not carefully investigated in previous RGB-D based studies.
II-B RGB-D Models
Traditional models rely heavily on hand-crafted features (e.g., contrast [73, 39, 38, 75], shape [36]). By embedding the classical principles (e.g., spatial bias [38], center-dark channel [46], 3D [77], background [47, 40]), difference of Gaussian [37], region classification [62], SVM [73, 45], graph knowledge [55], cellular automata [42], and Markov random field [75, 40], these models show that specific hand-crafted features can lead to decent performance. Several studies have also explored methods of integrating RGB and depth features via various combination strategies, using, for instance, angular densities [41], random forest regressors [62, 45], and minimum barrier distances [77]. More details are shown in Table II.
To overcome the limited expression ability of hand-crafted features, recent works [76, 44, 78, 43, 48, 49, 80, 61, 81, 82, 51] have proposed to introduce CNNs to infer salient objects from RGB-D data. BED [76] and DF [44] are two pioneering works for this, which introduced deep learning technology into the RGB-D based SOD task. More recently, Huang *et al. *developed a more efficient end-to-end model [78] with a modified loss function. To address the shortage of training data, Zhu *et al. * [48] presented a robust prior model with a guided depth-enhancement module for SOD. In addition, Chen *et al. *developed a series of novel approaches for this field, such as hidden structure transfer [43], a complementarity fusion module [49], an attention-aware component [80, 82], and dilated convolutions [81]. Nevertheless, these works, to the best of our knowledge, are dedicated to extracting general depth features/information.
We argue that not all information in a depth map is informative for SOD, and low-quality depth maps often introduce significant noise ( row in Fig. 1). Thus, we instead design a simple general framework D3Net, which is equipped with a depth-depurator unit to explicitly exclude low-quality depth maps when learning complementary feature.
III Proposed Dataset
III-A Dataset Overview
We introduce SIP, the first human activities oriented salient person detection dataset. Our dataset contains 929 RGB-D images belonging to eight different background scenes, under two different objecy boundary conditions, which portray multiple actors. Each of them wears different clothes in different images. Following [83], the images are carefully selected to cover diverse challenging cases (e.g., appearance change, occlusion, and shape complexity). Examples can be found in Fig. 2 and Fig. 3. The overall dataset can be downloaded from our website http://dpfan.net/SIPDataset/.
III-B Sensors and Data Acquisition
Image Collection: We used a Huawei Mate 10 to collect our images. The Mate 10’s rear cameras feature high-grade Leica SUMMILUX-H lenses with bright f/1.6 apertures and combine 12MP RGB and 20MP Monochrome (grayscale) sensors. The depth map is automatically estimated by the Mate10. We asked nine people, all dressed in different colors, to perform specific actions in real-world daily scenes. Instructions on how to perform the action to cover different challenging situations (e.g., occlusion, out-of-view) were given, but no instructions on style, angle, or speed were provided, in order to record realistic data.
Data Annotation: After capturing 5,269 images and the corresponding depth maps, we first manually selected about 2,500 images, each of which included one or multiple salient people. Following many famous SOD datasets [57, 84, 70, 85, 19, 86, 87, 88, 71, 89, 90], six viewers were further instructed to draw the bounding boxes (bboxes) around the most attention-grabbing person, according to their first instinct. We adopted the voting scheme described in [39] to discard images with low voting consistency and chose top 1,000 most satisfactory images. Another five annotators were then introduced to label accurate silhouettes of the salient objects according to the bboxes. We discard some images with low-quality annotations and finally obtained the 929 images with high-quality ground-truth annotations.
III-C Dataset Statistics
Center Bias: Center bias has been identified as one of the most significant biases of saliency detection datasets [91]. It occurs because subjects tend to look at the center of a screen [92]. As noted in [83], simply overlapping all of the maps in the dataset cannot well describe the degree of center bias.
Following [83], we present the statistics of two distance and in Fig. 4 (a & b), where and indicate how far an object center and margin (farthest) point in an object are from the image center, respectively. The center biases of our SIP and existing [63, 36, 64, 38, 39, 37, 66] datasets are shown in Fig. 4 (a & b). Except for our SIP and two small-scale datasets (GIT and SSD), most datasets present a high degree of center bias, i.e. the center of the object is close to the image center.
Size of Objects: We define object size as the ratio of salient object pixels to the total number of pixels in the image. The distribution (Fig. 4 (c)) of normalized object size in SIP are 0.48%66.85% (avg.: 20.43%).
Background Objects: As summarized in Table III, SIP includes diverse background objects (e.g., cars, trees, and grass). Models tested on such a dataset would likely be able to handle realistic scenes better and thus be more practical.
Object boundary Conditions: In Table III, we show different object boundary conditions (e.g., dark and clear) in our SIP dataset. One example of a dark condition , which often occurs in daily scenes, can be found in Fig. 3. The depth maps obtained in low-light conditions inevitably introduce more challenges for detecting salient objects.
Number of Salient Object: From Table I, we note that existing datasets fall short in their numbers of salient objects (e.g., they often only have one). Previous studies [93], however, have shown that humans can accurately enumerate up to at least five objects without counting. Thus, our SIP is designed to contain up to five salient objects per-image. The statistics of labelled objects in each image are shown in Table III (# Object).
IV Proposed Model
According to motivation described in Fig. 1, cross-modality feature extraction and depth filter unit are highly desired; therefore we proposed the simple general D3Net model (illustrated in Fig. 5) which contains two components, e.g., a three-stream feature learning module ( IV-A) and a depth depurator unit ( IV-B). The FLM (feature learning module) is utilized to extract the features from different modality. While the DDU (depth depurator unit) is acting as a gate to explicitly filter out the low-quality depth maps. If DDU decides to filter out this depth map, the data flow will pass along with the RgbNet. These components form a nested structure, and are elaborately designed to achieve robust performance and high generalization ability on various challenging datasets.
IV-A Feature Learning Module
Most existing models [94, 95, 96] have shown significant improvement for object detectors in several applications. These models typically share a common structure of Feature Pyramid Networks (FPN) [97]. Based on this motivation, we decide to introduce this component like FPN in our D3Net baseline to efficiently extract the features in a pyramid manner. The entire D3Net model is divided into the training phase and test phase due to the DDU has opted to use only in test phase.
As shown in Fig. 5, the designed FLM appears in training and test phases. The FLM consists of three sub-networks, i.e.,* RgbNet, RgbdNet*, and DepthNet. Note that the three sub-networks have the same structure while fed with different input channel. Specifically, each sub-network receives a re-scaled image with 224224 resolution. The goal of FLM is to obtain the corresponding predicted map S .
As in [97], we also use bottom-up, top-down pathway, and lateral connections to extract the features. Then the outputs will be proportionally organized at multiple levels. The FPN is independent of the backbone, thus for simplicity, we adopt the VGG-16 [98] architecture as our basic convolutional network to extract spatial features, while utilizing more powerful backbone [99] feature extractor could be explored in future. Some studies like [100] have shown that deeper layers retain more semantic information for locating objects. Based on this observation, we introduce a layer containing two 33 convolution kernels on the basis of the 5 layers VGG-16 structure to achieve this goal.
As shown in Fig. 6, our top-down features are built. For a specific layer (e.g., coarser layer), we first conduct a 2 upsampling using nearest neighbor operation. Then, the upsampled feature is concatenated with the finer feature map to obtain rich features. Before concatenated with coarse map, the finer map undergoes a 11 Conv operation to reduce the channel. For example, let denotes the four-dimensional feature tensor of the input of RgbdNet. Then we define a set of anchors on different layers so that we can obtain a set of pyramid feature tensors with , i.e., {, , , , , , , , , , } on {, i [1,11]}, respectively. Note that the {, , , , } are corresponding to the five convoluational stages of VGG-16 (i.e., {, , , , }).
IV-B Depth Depurator Unit (DDU)
In the test phase, we further adopt a new gate connection strategy to obtain the optimal predicted map. Low-quality depth maps introduce more noise than informative cues to the prediction. The goal of gate connection is to classify depth maps into reasonable and low-quality ones and not use the poor ones in the pipeline.
As illustrated in Fig. 7 (b), a stand-alone salient object in a high-quality depth map is typically characterized by well-defined closed boundaries and shows clear double peaks in its depth distribution. The statistics of the depth maps in existing datasets [63, 64, 38, 39, 37, 66] also support the fact that “high quality depth maps usually contain clear objects, while the elements in low-quality depth maps are cluttered (2nd row in Fig. 7)”. In order to reject the low-quality depth maps, we propose DDU as follows:
More specifically, in the test phase, the RGB and depth map is firstly re-sized to a fixed size (e.g., same as the training phase 224224) to reduce the computational complexity. As shown in Fig. 5 (right), the DDU is implemented with a gate connection. Denote the input images with three predicted maps , then the goal of DDU is to decide which predicted map is optimal.
[TABLE]
Intuitively, there are two ways to achieve this goal, e.g., post-processing and pre-processing. We propose a simple but general post-processing scheme for DDU. The DDU is considered in the test phase rather than in the training phase. Specially, a comparison unit is leveraged to assess the similarity between the and generated from DepthNet and RgbdNet, respectively.
[TABLE]
where the represents distance function, and indicates a fixed threshold. Note that the comparison unit is act as an index to decide which sub-network (RgbNet or RgbdNet) should be utilized.
The key of our comparison unit is the DDU. We utilize the comparison unit as a gate connection to decide the final/optimal predicted map P. Thus, our module can be formulated as:
[TABLE]
where . The can be viewed as a fixed weight. A more elegant formulation (adaptive weight) would be a part of our future work.
IV-C Implementation Details
DDU. The key component of our D3Net is the DDU. In this work, we show a simple yet powerful distance function formulated in (Eq. 2). We leverage the mean absolute error (MAE) metric (same as (Eq. 5)) to assess the distance between two maps. The basic motivation is that if the high-quality depth contains clear objects the DepthNet will easily detect these objects in (see first row in Fig. 7). The higher the quality of depth map in , the more similarity between the and the . In other words, the predicted map from RgbdNet have considered the feature from . If the quality of the depth map is low, then the predicted map from RgbdNet will quite different from the generated map from DepthNet. We have tested a set of values of the fixed threshold in (Eq. 2) such as, 0.01, 0.02, 0.05, 0.10, 0.15, 0.20, but have found achieve the best performance.
Loss Function. We adopt the widely-used cross entropy loss function to train our model:
[TABLE]
where and indicate the estimated saliency map (i.e., , , or ) and the GT map, respectively. , , and denotes the total number of pixels.
Training Settings. For fair comparisons, we follow the same training settings described in [51]. We select 1485 image pairs from the NJU2K [37] and 700 image pairs from NLPR [39] dataset, respectively, as the training data (Please refer to our website for the Trainlist.txt). The proposed D3Net is implemented using Python, with the Pytorch toolbox. We adopt Adam as the optimizer and the initial learning rate is 1e-4 and batchsize is set to 8. The total training is 30 epoch on a GTX TITAN X GPU with 12G of memory.
Data Augmentation. Due to the limited scale of existing datasets, we augment the training samples by flipped the images horizontally to overcome the risk of overfitting.
V Benchmarking Evaluation Results
We benchmark about 97K images (5,398 images 18 models) in this study, making it the largest and most comprehensive RGB-D based SOD benchmark to date.
V-A Experimental Settings
Models. We benchmark 18 SOTA models (see Table IV), including 10 traditional and 8 CNN based models.
Datasets. We conduct our experiments on seven datasets (see Table IV). The test sets of NJU2K [37] and NLPR [39] datasets, and the whole STERE [63], DES [38], SSD [66], LFSD [64], and SIP datasets are used for testing.
Runtime. In Table IV, we summarize the runtime of existing approaches. The timings are tested on the same platform: Intel Xeon(R) E5-2676v3 2.4GHz24 and GTX TITAN X. Since [43, 80, 81, 82, 49, 47, 68, 69, 67] have not released their codes, the timings are borrowed from the original papers or provided by the authors. Our D3Net does not apply post-processing (e.g., CRF), thus the computation only takes about 0.015s for a image.
V-B Evaluation Metrics
MAE . We follow Perazzi *et al. * [101] and evaluate the mean absolute error (MAE) between a real-valued saliency map and a binary ground truth for all image pixels:
[TABLE]
where is the total number of pixels. The MAE estimates the approximation degree between the saliency map and the ground truth map, and it is normalized to . The MAE provides a direct estimate of conformity between estimated and ground truth maps. However, for the MAE metric, small objects are naturally assigned smaller errors, while larger objects are given larger errors. The metric is also unable to tell where the error occurs [102].
PR Curve. We also follow Borji *et al. * [5] and provide the PR Curve. We divide a saliency map using a fixed threshold which changes from 0 to 255. For each threshold, a pair of recall & precision scores are computed, and then combined to form a precision-recall curve that describes the model performance in different situations. The overall evaluation results for PR Curves are shown in Fig. 8 (Top) and Fig. 9 (Left).
F-measure . F-measure is essentially a region-based similarity metric. Following the works by Cheng and Zhang *et al. * [103, 5], we also provide the max F-measure using various fixed (0-255) thresholds. The overall F-measure evaluation results under different thresholds on each dataset are shown in Fig. 8 (Bottom) and Fig. 9 (Right).
S-measure . Both the MAE and F-measure metrics ignore important structural information. However, behavioral vision studies have shown that the human visual system is highly sensitive to structures in scenes [58]. Thus, we additionally include the structure measure (S-measure [58]).The S-measure combines the region-aware () and object-aware () structural similarity as the final structure metric:
[TABLE]
where is the balance parameter and set to 0.5.
E-measure . E-measure is the recently proposed Enhanced alignment measure [59] from the binary map evaluation field. This measure is based on cognitive vision studies, and combines local pixel values with the image-level mean value in one term, jointly capturing image-level statistics and local pixel matching information. Here, we introduce max/maximal E-measure to provide a more comprehensive evaluation.
V-C Metric Statistics
For a given metric we consider different statistics. denote an image from a specific dataset . Thus, . Let be the metric score on image . The mean is the average dataset statistic defined as , where is the total number of images on the dataset. The mean statistics over different datasets are summarized in Table IV.
V-D Performance Comparison and Analysis
Performance of Traditional Models. Based on the overall performances listed in Table IV, we observe that “SE [42], MDSF [45], and DCMC [55] are the top-3 traditional algorithms.” Utilizing superpixel technology, both SE and DCMC explicitly extract the region contrast features from an RGB image. In contrast, MDSF formulates SOD as a pixel-wise binary labelling problem, which is solved by SVM.
Performance of Deep Models. Our D3Net, CPFP [51] and TANet [82] are the top-3 deep models out of all leading methods, showing the strong feature representation ability of deep learning for this task.
Traditional vs Deep Models. From Table IV, we observe that most of the deep models perform better than the traditional algorithms. Interestingly, MDSF [66] outperforms two deep models (i.e., DF [44] and AFNet [61]) on the NLPR dataset.
V-E Comparison with SOTAs
We compare our D3Net with 17 SOTA models in Table IV. In general, our model outperforms the best published result (CPFP [51]-CVPR’19) by large margins of 1.0% 5.8% on six datasets. Notably, we also achieve a significant improvement of 1.4% on the proposed real-world SIP dataset.
We also report saliency maps generated on various challenging scenes to show the visual superiority of our D3Net. Some representative examples are shown in Fig. 10, such as when the structure of the salient object in the depth map is partially (e.g., the , , and rows) or dramatically (i.e., the - rows) damaged. Specifically, in the and rows, the depth of the salient object is locally connected with background scenes. Also, the row contains multiple isolated salient objects. For these challenging situations, most of the existing top competitors are unlikely to locate the salient objects due to their poor depth maps or insufficient multi-modal fusion schemes. Although CPFP [51], TANet [82], and PCF [49] can generate more correct saliency maps than others, the salient object often introduces noticeable distinct backgrounds (- rows) or the fine details of the salient object are lost( row) due to the lack of a cross-modality learning ability. In contrast, our D3Net can eliminate low-quality depth maps and adaptively select complementary cues from RGB and depth images to infer the real salient object and highlight its details.
VI Applications
VI-A Human Activities
Nowadays, mobile phones generally have deep sensing cameras. With RGB-D salient object detection, users can better achieve the following functions: object extraction, a bokeh effect, mobile user recognition, etc. Many monitoring probes also have depth sensors, and RGB-D SOD can be helpful to the discovery of suspicious objects. For example, there is a lidar probe in autonomous vehicles designed to obtain depth information. RGB-D SOD is thus helpful for detecting basic objects such as pedestrians and signboards in these vehicles. There are also depth sensors in most industrial robots, so RGBD-SOD can help them better perceive the environment and take certain actions.
VI-B Background Changing Application
Background changing techniques have become vital for art designers to leverage the increasing volumes of available image database. Traditional designers utilize photoshop to design their products. This is quite a time-consuming task and requires significant technical knowledge. A large majority of potential users fail to grasp the high-skilled technique in the art design. Thus, an easy-to-use application is needed.
To overcome the above-mentioned drawbacks, salient object detection technology could be a potential solution. Previous similar works, such as the automatic generation of visual-textual applications [104, 105] motive us to create a background changing application for book cover layouts. We provide a prototype demo, as shown in Fig. 11. First, the user can upload an image as a candidate design image ((a) Input Image). Then, content-based image features, such as an RGB-D based saliency map, are considered in order to automatically generate salient objects. Finally, the system allows us to choose from our library of professionally designed book cover layouts ((b) Template). By combining high-level template constraints and low-level image features, we obtain the background changed book cover ((d) Results).
Since designing a complete software system is not our main focus in this article, Future researchers can follow yang *et al. * [104] and set our visual background image with a specified topic [105]. In stage two, the input image is resized to match the target style size and preserve the salient region according to the inference of our D3Net model.
VII Discussion
Based on our comprehensive benchmarking results, we present our conclusions to the most important questions that may benefit the research community to rethink the RGB-D image for salient object detection.
VII-A Ablation Study.
We now provide a detailed analysis on the proposed baseline D3Net model. To verify the effectiveness of the depth map filter mechanism (the DDU), we derive two ablation studies: w/o DDU and DDU, which refer to our D3Net without utilizing DDU or include the DDU. For w/o DDU, we further test the performance of the three sub-network in the test phase of D3Net. In Table V, we observe that RgbdNet performs better than RgbNet on the SIP, STERE, DES, LFSD, SSD, NJU2K datasets. It indicates that the cross-modality (RGB and depth) features show strong promise for RGB-D image representation learning. In most cases, however, DepthNet has lower performance than DepthNet and RgbNet. It shows that only based on a single modality, it is difficult for the model to construct the structure of the geometry in an image.
From Table V, we also observed that the use of the DDU improves the performance (compared to RgbdNet) to a certain extent on the STERE, DES, NJU2K, and NLPR datasets. We attribute the improvement to the DDU being able to discard low-quality depth maps and select one optimal path (RgbNet or RgbdNet). For the SSD dataset, however, the DDU achieves comparable performance to the single stream network (i.e., RgbdNet). It is worth mentioning that D3Net outperforms any prior approach intended for SOD, without any post-processing techniques, such as CRF, which are typically used to boost scores. In order to know the lower and upper bound of our D3Net, we additionally select the optimal path (RgbdNet or RgbNet) of the D3Net. For example, for a specific RGB () and depth map (), the two predicted maps i.e., and , can be assessed separately. Thus, for each input we know the best output in existing network. We aggregate all the best and worst results and achieve the upper bound and lower bound of our D3Net. From existing results listed in Table V, D3Net still has a 1.6% performance gap on average related to the upper bound.
VII-B Limitations
First, it is worth pointing out that the number of images in the SIP dataset is relatively small compared with most datasets for RGB salient object detection. Our goal behind building this dataset is to explore the potential direction of smartphone based applications. As can be seen from the benchmark results and the demo application described in VI, salient object detection over real human activity scenes is a promising direction. We plan to keep growing the dataset with more challenging situations and various kinds of foreground persons.
Second, our simple general framework D3Net consists of three sub-networks, which may increase the memory on a light-weight device. In a real environment, several strategies can be considered to avoid this, such as replacing the backbone with MobileNet V2 [106], dimension reduction [107], or using the recently released ESPNet V2 [108] models. Third, we present the lower and upper bounds of the DDU. The optimal upper bound is obtained by feeding the input into RgbdNet or RgbNet so that the predicted map is optimal. As shown in Table V, our DDU module does not achieve the best upper bound on the current training subset. There is thus still an opportunity to design a better DDU to further improve the performance.
VIII Conclusions
We present systematic studies on RGB-D based salient object detection by: (1) Introducing a new human-oriented SIP dataset reflecting the realistic in-the-wild mobile use scenarios. (2) Designing a novel D3Net. (3) Conducting so far the largest-scale (97K) benchmark. Compared with existing datasets, SIP covers several challenges (e.g., background diversity, occlusion, etc) of human in the real environments. Moreover, the proposed baseline achieves promising results. It is among the fastest methods, making it a practical solution to RGB-D salient object detection. The comprehensive benchmarking results include 32 summarized SOTAs and 18 evaluated traditional/deep models. We hope this benchmark will accelerate not only the development of this area but also others (e.g., stereo estimating/matching [109], multiple salient person detection, salient instance detection [19], sensitive object detection [110], image segmentation [111]). Note that the methods utilized in our D3Net baseline are simple and more complex components (e.g., PDC in [112]) or training strategy [113] are promising to increase the performance. In the future, we plan to incorporate recently proposed techniques e.g., the weighted triplet loss [114], hierarchical deep features [115], visual question-driven saliency [116], into our D3Net to further boost the performance. After this submission, there are many interesting models, such as UCNet [117], JL-DCF [118], GFNet [119], DMRA [120], ERNet [121], BiANet [122], etc, have been released. Please refer to our online leaderboard (http://dpfan.net/d3netbenchmark/) for more details. This website will be updated continually. We foresee this study driving salient object detection towards real-world application scenarios with multiple salient persons and complex interactions through the mobile device (e.g., smartphone or tablets).
**Acknowledgment. ** We thank Jia-Xing Zhao, Yun Liu, and Qibin Hou for insightful feedback. This research was supported by Major Project for New Generation of AI under Grant No. 2018AAA0100400, NSFC (61922046), and Tianjin Natural Science Foundation (17JCJQJC43700).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object detection: A survey,” Computational Visual Media , vol. 5, no. 2, pp. 117–150, 2019.
- 2[2] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji, “Detect globally, refine locally: A novel approach to saliency detection,” in IEEE Conf. Comput. Vis. Pattern Recog. , 2018, pp. 3127–3135.
- 3[3] H. Fu, D. Xu, S. Lin, and J. Liu, “Object-based rgbd image co-segmentation with mutex constraint,” in IEEE Conf. Comput. Vis. Pattern Recog. , 2015, pp. 4428–4436.
- 4[4] P. Zhang, W. Liu, H. Lu, and C. Shen, “Salient object detection with lossless feature reflection and weighted structural loss,” IEEE T. Image Process. , 2019.
- 5[5] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient Object Detection: A Benchmark,” IEEE T. Image Process. , vol. 24, no. 12, pp. 5706–5722, 2015.
- 6[6] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji, “Revisiting video saliency prediction in the deep learning era,” IEEE T. Pattern Anal. Mach. Intell. , 2019.
- 7[7] D.-P. Fan, W. Wang, M.-M. Cheng, and J. Shen, “Shifting more attention to video salient object detection,” in IEEE Conf. Comput. Vis. Pattern Recog. , 2019, pp. 8554–8564.
- 8[8] Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang, “Multi-source weak supervision for saliency detection,” in IEEE Conf. Comput. Vis. Pattern Recog. , 2019.
