Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
Hao Chen, Feihong Shen

TL;DR
This paper introduces a Hierarchical Cross-modal Transformer (HCT) for RGB-D salient object detection, effectively modeling long-range dependencies and cross-modal relationships through hierarchical attention mechanisms, outperforming existing CNN-based methods.
Contribution
The paper proposes a novel multi-modal transformer with hierarchical cross-modal attention, a feature pyramid module, and a consistency-complementarity module for improved RGB-D salient object detection.
Findings
Significant performance improvement over state-of-the-art models.
Effective modeling of long-range and cross-modal dependencies.
Validated on multiple public datasets.
Abstract
Most of existing RGB-D salient object detection (SOD) methods follow the CNN-based paradigm, which is unable to model long-range dependencies across space and modalities due to the natural locality of CNNs. Here we propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem. Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically to respect the modality gap and spatial discrepancy in unaligned regions. Specifically, we propose to use intra-modal self-attention to explore complementary global contexts, and measure spatial-aligned inter-modal attention locally to capture cross-modal correlations. In addition, we present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Face Recognition and Perception · Gaze Tracking and Assistive Technology
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer · Softmax
