Efficient Multi-Task Scene Analysis with RGB-D Transformers
S\"ohnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard Rabes, and Horst-Michael Gross

TL;DR
This paper presents EMSAFormer, an efficient multi-task scene analysis method using RGB-D Transformers that achieves state-of-the-art results on indoor datasets while maintaining real-time inference on embedded hardware.
Contribution
It introduces a Transformer-based encoder for multi-task scene analysis, replacing CNNs, and provides an optimized implementation for real-time robotic applications.
Findings
Achieves state-of-the-art performance on NYUv2, SUNRGB-D, and ScanNet datasets.
Enables real-time inference at 39.1 FPS on NVIDIA Jetson AGX Orin.
Demonstrates effective integration of RGB and depth data in a single Transformer encoder.
Abstract
Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
