An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders
Scott C. Lowe, Joakim Bruslund Haurum, Sageev Oore, Thomas B., Moeslund, Graham W. Taylor

TL;DR
This study evaluates how well pretrained image models, especially self-supervised ones, can generalize to unseen datasets through clustering, revealing differences in feature representations and proposing silhouette score as a proxy for clustering quality.
Contribution
It provides a comprehensive benchmarking of pretrained encoders on unseen datasets, highlighting differences between supervised and self-supervised models and introducing silhouette score as a clustering performance proxy.
Findings
Supervised encoders perform better within training domain.
Self-supervised encoders excel on unseen datasets.
Silhouette score in UMAP space correlates with clustering performance.
Abstract
Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Neural Networks and Applications
