Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision   Transformers for 3D Object Recognition

Georgios Tziafas; Hamidreza Kasaei

arXiv:2210.00843·cs.CV·March 8, 2023·1 cites

Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition

Georgios Tziafas, Hamidreza Kasaei

PDF

Open Access

TL;DR

This paper introduces a simple transfer learning approach for RGB-D 3D object recognition using Vision Transformers, emphasizing late fusion techniques that outperform early fusion and achieve state-of-the-art accuracy.

Contribution

It proposes a novel transfer baseline for RGB-D ViT models focusing on late fusion, demonstrating superior performance over early fusion and unimodal methods.

Findings

01

Late fusion outperforms early fusion in RGB-D ViT models.

02

Achieved up to 95.4% top-1 accuracy on the ROD dataset.

03

Outperformed previous methods by over 8% in lifelong learning scenarios.

Abstract

The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to align RGB with 3D information. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. Compared to previous works in multimodal Transformers, the key challenge here is to use the attested flexibility of ViTs to capture cross-modal interactions at the downstream and not the pretraining stage. We explore which depth representation is better in terms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Robotics and Sensor-Based Localization

MethodsMulti-Head Attention · Attention Is All You Need · ALIGN · Linear Layer · Label Smoothing · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout