A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis
Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C., Ribas, Bernard De Baets, Odemir M. Bruno

TL;DR
This paper evaluates various Vision Transformers for texture recognition, demonstrating their superior performance over CNNs and hand-engineered models, especially with strong pre-training and in real-world scenarios.
Contribution
It provides a comprehensive comparison of 21 ViT variants for texture analysis, highlighting their potential and efficiency in this specific application.
Findings
ViTs outperform CNNs and hand-engineered models in texture recognition.
Stronger pre-training improves ViT performance on in-the-wild textures.
ViT-B with DINO, BeiTv2, and Swin are among the top models.
Abstract
Texture, a significant visual attribute in images, has been extensively investigated across various image recognition applications. Convolutional Neural Networks (CNNs), which have been successful in many computer vision tasks, are currently among the best texture analysis approaches. On the other hand, Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition, causing a paradigm shift in the field. However, ViTs have so far not been scrutinized for texture recognition, hindering a proper appreciation of their potential in this specific setting. For this reason, this work explores various pre-trained ViT architectures when transferred to tasks that rely on textures. We review 21 different ViT variants and perform an extensive evaluation and comparison with CNNs and hand-engineered models on several tasks, such as assessing robustness to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels
