Unified Object Detector for Different Modalities based on Vision Transformers
Xiaoke Shen, Ioannis Stamos

TL;DR
This paper presents a unified object detection model based on vision transformers that seamlessly switches between RGB and depth modalities without retraining, demonstrating superior performance across diverse conditions.
Contribution
The paper introduces a novel unified detection framework combining cross/inter-modality transfer learning with vision transformers, enabling modality switching without model updates.
Findings
Achieves comparable or better performance than state-of-the-art on SUN RGB-D dataset.
Introduces a novel inter-modality mixing method for improved results.
Demonstrates effective modality switching in robotics scenarios.
Abstract
Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this paper, we extend this approach by combining cross/inter-modality transfer learning with a vision transformer to develop a unified detector that achieves superior performance across diverse modalities. Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors in varying lighting conditions. Importantly, the system requires no model architecture or weight updates to enable this smooth transition. Specifically, the system uses the depth sensor during low-lighting conditions (night time) and both the RGB camera…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Industrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Layer Normalization · Softmax · Multi-Head Attention · Residual Connection · Vision Transformer
