Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer
Minh Bui, Kostas Alexis

TL;DR
This paper introduces a diffusion-based RGB-D semantic segmentation framework utilizing a deformable attention transformer encoder, achieving state-of-the-art results with enhanced robustness and reduced training time on standard datasets.
Contribution
It presents a novel diffusion-based approach combined with a deformable attention transformer for improved RGB-D semantic segmentation performance.
Findings
State-of-the-art accuracy on NYUv2 and SUN-RGBD datasets.
Robust performance in challenging scenarios with less training time.
Effective modeling of RGB-D image distributions.
Abstract
Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections
