Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study
Alexander Hackett, Srikanth Thudumu, Ginny Fisher, Jason Fisher

TL;DR
This study evaluates how different pretraining objectives affect the quality of representations in extreme low-data fine-grained classification, providing guidance for encoder selection in data-scarce domains.
Contribution
It systematically compares four pretraining methods on a custom dataset, revealing their relative strengths under various probing conditions.
Findings
Supervised and contrastive encoders excel with linear classifiers.
MAE improves with nonlinear classifiers.
DINOv3 underperforms compared to other methods.
Abstract
Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
