A Comparative Survey of Vision Transformers for Feature Extraction in   Texture Analysis

Leonardo Scabini; Andre Sacilotti; Kallil M. Zielinski; Lucas C.; Ribas; Bernard De Baets; Odemir M. Bruno

arXiv:2406.06136·cs.CV·June 11, 2024·3 cites

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C., Ribas, Bernard De Baets, Odemir M. Bruno

PDF

Open Access

TL;DR

This paper evaluates various Vision Transformers for texture recognition, demonstrating their superior performance over CNNs and hand-engineered models, especially with strong pre-training and in real-world scenarios.

Contribution

It provides a comprehensive comparison of 21 ViT variants for texture analysis, highlighting their potential and efficiency in this specific application.

Findings

01

ViTs outperform CNNs and hand-engineered models in texture recognition.

02

Stronger pre-training improves ViT performance on in-the-wild textures.

03

ViT-B with DINO, BeiTv2, and Swin are among the top models.

Abstract

Texture, a significant visual attribute in images, has been extensively investigated across various image recognition applications. Convolutional Neural Networks (CNNs), which have been successful in many computer vision tasks, are currently among the best texture analysis approaches. On the other hand, Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition, causing a paradigm shift in the field. However, ViTs have so far not been scrutinized for texture recognition, hindering a proper appreciation of their potential in this specific setting. For this reason, this work explores various pre-trained ViT architectures when transferred to tasks that rely on textures. We review 21 different ViT variants and perform an extensive evaluation and comparison with CNNs and hand-engineered models on several tasks, such as assessing robustness to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels