Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection

Hao Chen; Feihong Shen

arXiv:2302.08052·cs.CV·February 17, 2023·1 cites

Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection

Hao Chen, Feihong Shen

PDF

Open Access

TL;DR

This paper introduces a Hierarchical Cross-modal Transformer (HCT) for RGB-D salient object detection, effectively modeling long-range dependencies and cross-modal relationships through hierarchical attention mechanisms, outperforming existing CNN-based methods.

Contribution

The paper proposes a novel multi-modal transformer with hierarchical cross-modal attention, a feature pyramid module, and a consistency-complementarity module for improved RGB-D salient object detection.

Findings

01

Significant performance improvement over state-of-the-art models.

02

Effective modeling of long-range and cross-modal dependencies.

03

Validated on multiple public datasets.

Abstract

Most of existing RGB-D salient object detection (SOD) methods follow the CNN-based paradigm, which is unable to model long-range dependencies across space and modalities due to the natural locality of CNNs. Here we propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem. Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically to respect the modality gap and spatial discrepancy in unaligned regions. Specifically, we propose to use intra-modal self-attention to explore complementary global contexts, and measure spatial-aligned inter-modal attention locally to capture cross-modal correlations. In addition, we present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Face Recognition and Perception · Gaze Tracking and Assistive Technology

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer · Softmax