LiT Tuned Models for Efficient Species Detection
Andre Nakkab, Benjamin Feuer, Chinmay Hegde

TL;DR
This paper presents a new methodology for adapting fine-grained image datasets for vision-language pretraining, achieving state-of-the-art zero-shot species classification with a frozen vision model.
Contribution
It introduces locked-image text tuning, enabling effective transfer learning using pre-trained frozen vision models on species detection datasets.
Findings
Achieved new state-of-the-art zero-shot classification accuracy on iNaturalist-2021.
Demonstrated that language alignment alone can produce strong transfer learning results.
Enabled utilization of high-quality vision-language models in species detection applications.
Abstract
Recent advances in training vision-language models have demonstrated unprecedented robustness and transfer learning effectiveness; however, standard computer vision datasets are image-only, and therefore not well adapted to such training methods. Our paper introduces a simple methodology for adapting any fine-grained image classification dataset for distributed vision-language pretraining. We implement this methodology on the challenging iNaturalist-2021 dataset, comprised of approximately 2.7 million images of macro-organisms across 10,000 classes, and achieve a new state-of-the art model in terms of zero-shot classification accuracy. Somewhat surprisingly, our model (trained using a new method called locked-image text tuning) uses a pre-trained, frozen vision representation, proving that language alignment alone can attain strong transfer learning performance, even on fractious,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Microbial infections and disease research
