DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

Yan Gong; Jianli Lu; Yongsheng Gao; Jie Zhao; Xiaojuan Zhang; and Susanto Rahardja

arXiv:2511.13047·cs.CV·November 18, 2025

DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

Yan Gong, Jianli Lu, Yongsheng Gao, Jie Zhao, Xiaojuan Zhang, and Susanto Rahardja

PDF

Open Access

TL;DR

DiffPixelFormer introduces a novel differential pixel-aware Transformer that improves RGB-D indoor scene segmentation by enhancing intra-modal features and modeling inter-modal interactions for more accurate pixel-level alignment.

Contribution

It proposes the Intra-Inter Modal Interaction Block and a dynamic fusion strategy to better model intra- and inter-modal relationships in RGB-D segmentation.

Findings

01

Achieves state-of-the-art mIoU scores on SUN RGB-D and NYUDv2 datasets.

02

Outperforms previous methods like DFormer-L by significant margins.

03

Demonstrates effective pixel-level cross-modal alignment and scene understanding.

Abstract

Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Vision and Imaging