DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation
Yan Gong, Jianli Lu, Yongsheng Gao, Jie Zhao, Xiaojuan Zhang, and Susanto Rahardja

TL;DR
DiffPixelFormer introduces a novel differential pixel-aware Transformer that improves RGB-D indoor scene segmentation by enhancing intra-modal features and modeling inter-modal interactions for more accurate pixel-level alignment.
Contribution
It proposes the Intra-Inter Modal Interaction Block and a dynamic fusion strategy to better model intra- and inter-modal relationships in RGB-D segmentation.
Findings
Achieves state-of-the-art mIoU scores on SUN RGB-D and NYUDv2 datasets.
Outperforms previous methods like DFormer-L by significant margins.
Demonstrates effective pixel-level cross-modal alignment and scene understanding.
Abstract
Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
