CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

TL;DR
This paper introduces CLIBD, a multimodal contrastive learning approach that aligns images, DNA barcodes, and text labels to improve biodiversity monitoring, enabling accurate classification of known and unknown insect species without fine-tuning.
Contribution
It is the first to fuse barcode DNA and image data using contrastive learning for biodiversity classification, surpassing single-modality methods in zero-shot accuracy.
Findings
Over 8% improvement in zero-shot classification accuracy
Effective alignment of images, DNA, and text in a unified embedding space
Enables classification of unknown insect species
Abstract
Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
Peer Reviews
Decision·ICLR 2025 Poster
1) The paper is well-written and and organized. The proposed approach and the experiments are clearly described. 2) The proposed approach is effective in tackling the problem at hand, while being simpler than common alternatives in the field.
1) The paper presents no methodological novelty, and mostly applies existing techniques in a standard way to a particular use case.
1. Incorporation of DNA as a modality to align the image embedding against instead of text is well motivated. 2. Extensive experiments and ablations are provided.
1. The accuracy when doing image to DNA on unseen species is not quite significant although it is better than BioCLIP’s approach of doing image to text. This indicates the image encoder is still not strong enough to generate a good DNA aligned embedding just from the image. Perhaps this can improve with more data.
* The idea of jointly embedding DNA barcodes with images and taxonomic information is interesting. * The experiments in the paper are extensive - there is a lot of technical content, and it's clear that a lot of effort went in to this work. There are many quantitative results, in addition to interesting qualitative results (e.g. Fig. 3, Fig. 5). * The paper is very well-written. * The hyperparameters and training procedures are clearly spelled out.
My major issue with the paper is missing baselines: * The paper compares their multimodal representation learning approach against the unimodal pretrained models they start with. They show that their method is better, and conclude that multimodality is important. However, this claim is not justified - couldn't the benefit be from the additional training each modality received? It seems to me that the fair comparison would be to take the unimodal models and run unimodal CLIP-style training (with
Code & Models
Videos
Taxonomy
TopicsSpecies Distribution and Climate Change · Identification and Quantification in Food
MethodsALIGN · Contrastive Learning
