Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal   Vision Transformer-CNN Models

Songsong Xiong; Georgios Tziafas; Hamidreza Kasaei

arXiv:2210.04613·cs.CV·March 7, 2023·1 cites

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

Songsong Xiong, Georgios Tziafas, Hamidreza Kasaei

PDF

Open Access 1 Repo

TL;DR

This paper introduces a hybrid Vision Transformer-CNN model for fine-grained 3D object recognition in robotics, utilizing synthetic datasets and demonstrating superior accuracy and practical robotic integration.

Contribution

It presents a novel hybrid multi-modal ViT-CNN approach and provides new synthetic datasets for fine-grained 3D object recognition in robotics.

Findings

01

Achieved over 94% accuracy on restaurant dataset

02

Outperformed CNN-only and ViT-only baselines

03

Demonstrated effectiveness in real-world robotic scenarios

Abstract

Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

github-songsong/fine-grained-pointcloud-object-dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Residual Connection · Label Smoothing · Adam