Dissecting RGB-D Learning for Improved Multi-modal Fusion

Hao Chen; Haoran Zhou; Yunshu Zhang; Zheng Lin; Yongjian Deng

arXiv:2308.10019·cs.CV·June 17, 2025

Dissecting RGB-D Learning for Improved Multi-modal Fusion

Hao Chen, Haoran Zhou, Yunshu Zhang, Zheng Lin, Yongjian Deng

PDF

Open Access

TL;DR

This paper introduces an analytical framework to dissect RGB-D multi-modal learning, revealing feature discrepancies and cooperation rules, and proposes a simple fusion strategy that improves performance across tasks.

Contribution

The paper presents a novel dissection method for RGB-D models, providing insights into feature interactions and proposing an effective fusion strategy based on these insights.

Findings

01

Discrepancy in cross-modal features identified

02

Hybrid cooperation rule enhances inference

03

Proposed fusion strategy improves multiple tasks

Abstract

In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection