Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study
Hugo Markoff, Stefan Hein Bengtson, and Michael {\O}rsted

TL;DR
This study evaluates the effectiveness of Vision Transformer models combined with various clustering techniques for automatically grouping unlabeled animal images at the species level, demonstrating high accuracy and ecological relevance.
Contribution
It provides a comprehensive benchmarking framework for applying ViT models to ecological image clustering, including open-source tools and practical recommendations.
Findings
Near-perfect species-level clustering with DINOv3 embeddings and t-SNE.
Unsupervised methods achieve high performance without prior species knowledge.
Robust extraction of intra-specific variation such as age and sex differences.
Abstract
Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
