Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

Dimitrios N. Vlachogiannis; Dimitrios A. Koutsomitropoulos

arXiv:2512.09579·cs.CV·December 11, 2025

Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

Dimitrios N. Vlachogiannis, Dimitrios A. Koutsomitropoulos

PDF

Open Access

TL;DR

This paper evaluates various Vision Transformers (ViTs) for object recognition, detection, and medical imaging, demonstrating their competitive performance and advantages over traditional CNNs in understanding global image context.

Contribution

It provides a comprehensive comparison of pure, hierarchical, and hybrid ViTs against CNNs across multiple tasks and datasets, highlighting the effectiveness of hybrid models like Swin and CvT.

Findings

01

Hybrid and hierarchical ViTs outperform CNNs in accuracy and efficiency.

02

Data augmentation significantly improves medical image classification performance.

03

Swin Transformer achieves a strong balance between accuracy and computational cost.

Abstract

Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI