# Cross-Modal Weakly Supervised RGB-D Salient Object Detection with a Focus on Filamentary Structures

**Authors:** Yifan Ding, Weiwei Chen, Guomin Zhang, Zhaoming Feng, Xuan Li

PMC · DOI: 10.3390/s25102990 · Sensors (Basel, Switzerland) · 2025-05-09

## TL;DR

This paper introduces a new framework for detecting salient objects in RGB-D images using weak supervision, especially improving detection of filamentary structures.

## Contribution

A novel cross-modal weakly supervised SOD framework with pseudo-label generation and asymmetric feature extraction for better filamentary object detection.

## Key findings

- The proposed framework outperforms existing methods in detecting salient objects with complex and filamentary structures.
- The use of cross-modal pseudo-labels and asymmetric networks improves boundary accuracy in RGB-D saliency detection.
- An edge constraint module enhances the sharpness of predicted salient regions.

## Abstract

Current weakly supervised salient object detection (SOD) methods for RGB-D images mostly rely on image-level labels and sparse annotations, which makes it difficult to completely contour object boundaries in complex scenes, especially when detecting objects with filamentary structures. To address the aforementioned issues, we propose a novel cross-modal weakly supervised SOD framework. The framework can adequately exploit the advantages of cross-modal weak labels to generate high-quality pseudo-labels, and it can fully couple the multi-scale features of RGB and depth images for precise saliency prediction. The framework mainly consists of a cross-modal pseudo-label generation network (CPGN) and an asymmetric salient-region prediction network (ASPN). Among them, the CPGN is proposed to sufficiently leverage the precise pixel-level guidance provided by point labels and the enhanced semantic supervision provided by text labels to generate high-quality pseudo-labels, which are used to supervise the subsequent training of the ASPN. To better capture the contextual information and geometric features from RGB and depth images, the ASPN, an asymmetrically progressive network, is proposed to gradually extract multi-scale features from RGB and depth images by using the Swin-Transformer and CNN encoders, respectively. This significantly enhances the model’s ability to perceive detailed structures. Additionally, an edge constraint module (ECM) is designed to sharpen the edges of the predicted salient regions. The experimental results demonstrate that the method shows better performance in depicting salient objects, especially the filamentary structures, than other weakly supervised SOD methods.

## Full-text entities

- **Genes:** ASPN (asporin) [NCBI Gene 54829] {aka OS3, PLAP-1, PLAP1, SLRR1C}, SOD1 (superoxide dismutase 1) [NCBI Gene 6647] {aka ALS, ALS1, HEL-S-44, IPOA, SOD, STAHP}, CLIP1 (CAP-Gly domain containing linker protein 1) [NCBI Gene 6249] {aka CLIP, CLIP-170, CLIP170, CYLN1, RSN}
- **Diseases:** CAM (MESH:D008311), injury to (MESH:D014947)
- **Chemicals:** ECM (-)
- **Species:** Felis catus (cat, species) [taxon 9685], Homo sapiens (human, species) [taxon 9606], Canis lupus familiaris (dog, subspecies) [taxon 9615]
- **Cell lines:** ViT- — Homo sapiens (Human), Invasive breast carcinoma of no special type, Cancer cell line (CVCL_4Y56)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12115235/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12115235/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/PMC12115235/full.md

---
Source: https://tomesphere.com/paper/PMC12115235