BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias,, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia, Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X., Chang

TL;DR
This paper introduces BIOSCAN-5M, a large multimodal insect dataset with images, DNA sequences, and metadata, and establishes benchmark tasks to advance machine learning applications in insect biodiversity monitoring.
Contribution
The paper presents BIOSCAN-5M, a comprehensive multimodal insect dataset with new benchmark tasks for classification, clustering, and cross-modal learning.
Findings
Pretraining on DNA barcodes improves species classification accuracy.
Zero-shot clustering effectively groups insects across modalities.
Contrastive learning creates shared embeddings for multi-modal data.
Abstract
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical, and size information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpecies Distribution and Climate Change · Insect and Arachnid Ecology and Behavior · Animal and Plant Science Education
MethodsLib · Contrastive Learning
