Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition
Georgios Tziafas, Hamidreza Kasaei

TL;DR
This paper introduces a simple transfer learning approach for RGB-D 3D object recognition using Vision Transformers, emphasizing late fusion techniques that outperform early fusion and achieve state-of-the-art accuracy.
Contribution
It proposes a novel transfer baseline for RGB-D ViT models focusing on late fusion, demonstrating superior performance over early fusion and unimodal methods.
Findings
Late fusion outperforms early fusion in RGB-D ViT models.
Achieved up to 95.4% top-1 accuracy on the ROD dataset.
Outperformed previous methods by over 8% in lifelong learning scenarios.
Abstract
The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to align RGB with 3D information. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. Compared to previous works in multimodal Transformers, the key challenge here is to use the attested flexibility of ViTs to capture cross-modal interactions at the downstream and not the pretraining stage. We explore which depth representation is better in terms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · ALIGN · Linear Layer · Label Smoothing · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout
