BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity
Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng and, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, and Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar and, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian

TL;DR
BioTrove is the largest curated image dataset for biodiversity, enabling AI models to better recognize and analyze a vast array of species across multiple kingdoms, supporting ecological and agricultural applications.
Contribution
The paper introduces BioTrove, a comprehensive, publicly accessible biodiversity image dataset with rich annotations, and demonstrates its utility through training models and establishing new benchmarks.
Findings
BioTrove contains 161.9 million images across 366.6K species.
Models trained on BioTrove-Train achieve improved zero-shot recognition accuracy.
New benchmarks for biodiversity image recognition are introduced.
Abstract
We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpecies Distribution and Climate Change
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
