Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer

Minh Bui; Kostas Alexis

arXiv:2409.15117·cs.CV·October 13, 2025

Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer

Minh Bui, Kostas Alexis

PDF

Open Access

TL;DR

This paper introduces a diffusion-based RGB-D semantic segmentation framework utilizing a deformable attention transformer encoder, achieving state-of-the-art results with enhanced robustness and reduced training time on standard datasets.

Contribution

It presents a novel diffusion-based approach combined with a deformable attention transformer for improved RGB-D semantic segmentation performance.

Findings

01

State-of-the-art accuracy on NYUv2 and SUN-RGBD datasets.

02

Robust performance in challenging scenarios with less training time.

03

Effective modeling of RGB-D image distributions.

Abstract

Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections