Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models
Songsong Xiong, Georgios Tziafas, Hamidreza Kasaei

TL;DR
This paper introduces a hybrid Vision Transformer-CNN model for fine-grained 3D object recognition in robotics, utilizing synthetic datasets and demonstrating superior accuracy and practical robotic integration.
Contribution
It presents a novel hybrid multi-modal ViT-CNN approach and provides new synthetic datasets for fine-grained 3D object recognition in robotics.
Findings
Achieved over 94% accuracy on restaurant dataset
Outperformed CNN-only and ViT-only baselines
Demonstrated effectiveness in real-world robotic scenarios
Abstract
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Residual Connection · Label Smoothing · Adam
