Learning Effective RGB-D Representations for Scene Recognition
Xinhang Song, Shuqiang Jiang, Luis Herranz, Chengpeng Chen

TL;DR
This paper introduces a new RGB-D scene recognition approach that learns depth-specific features directly, utilizes RGB-D videos for better depth coverage, and achieves state-of-the-art results on multiple datasets.
Contribution
It proposes a novel architecture and training method for depth feature learning, and introduces the ISIA RGB-D video dataset for scene recognition.
Findings
Achieves state-of-the-art results on NYUD2 and SUN RGB-D datasets.
Demonstrates improved scene recognition using RGB-D videos.
Addresses depth data limitations with a new training approach.
Abstract
Deep convolutional networks (CNN) can achieve impressive results on RGB scene recognition thanks to large datasets such as Places. In contrast, RGB-D scene recognition is still underdeveloped in comparison, due to two limitations of RGB-D data we address in this paper. The first limitation is the lack of depth data for training deep learning models. Rather than fine tuning or transferring RGB-specific features, we address this limitation by proposing an architecture and a two-step training approach that directly learns effective depth-specific features using weak supervision via patches. The resulting RGB-D model also benefits from more complementary multimodal features. Another limitation is the short range of depth sensors (typically 0.5m to 5.5m), resulting in depth images not capturing distant objects in the scenes that RGB images can. We show that this limitation can be addressed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
