CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

ZeMing Gong; Austin T. Wang; Xiaoliang Huo; Joakim Bruslund Haurum; Scott C. Lowe; Graham W. Taylor; Angel X. Chang

arXiv:2405.17537·cs.AI·December 10, 2025·3 cites

CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

PDF

Open Access 4 Repos 1 Models 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces CLIBD, a multimodal contrastive learning approach that aligns images, DNA barcodes, and text labels to improve biodiversity monitoring, enabling accurate classification of known and unknown insect species without fine-tuning.

Contribution

It is the first to fuse barcode DNA and image data using contrastive learning for biodiversity classification, surpassing single-modality methods in zero-shot accuracy.

Findings

01

Over 8% improvement in zero-shot classification accuracy

02

Effective alignment of images, DNA, and text in a unified embedding space

03

Enables classification of unknown insect species

Abstract

Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 5

Strengths

1) The paper is well-written and and organized. The proposed approach and the experiments are clearly described. 2) The proposed approach is effective in tackling the problem at hand, while being simpler than common alternatives in the field.

Weaknesses

1) The paper presents no methodological novelty, and mostly applies existing techniques in a standard way to a particular use case.

Reviewer 02Rating 6Confidence 3

Strengths

1. Incorporation of DNA as a modality to align the image embedding against instead of text is well motivated. 2. Extensive experiments and ablations are provided.

Weaknesses

1. The accuracy when doing image to DNA on unseen species is not quite significant although it is better than BioCLIP’s approach of doing image to text. This indicates the image encoder is still not strong enough to generate a good DNA aligned embedding just from the image. Perhaps this can improve with more data.

Reviewer 03Rating 6Confidence 4

Strengths

* The idea of jointly embedding DNA barcodes with images and taxonomic information is interesting. * The experiments in the paper are extensive - there is a lot of technical content, and it's clear that a lot of effort went in to this work. There are many quantitative results, in addition to interesting qualitative results (e.g. Fig. 3, Fig. 5). * The paper is very well-written. * The hyperparameters and training procedures are clearly spelled out.

Weaknesses

My major issue with the paper is missing baselines: * The paper compares their multimodal representation learning approach against the unimodal pretrained models they start with. They show that their method is better, and conclude that multimodality is important. However, this claim is not justified - couldn't the benefit be from the additional training each modality received? It seems to me that the fair comparison would be to take the unimodal models and run unimodal CLIP-style training (with

Code & Models

Repositories

Models

🤗
bioscan-ml/clibd
model· ♡ 1
♡ 1

Datasets

bioscan-ml/clibd
dataset· 215 dl
215 dl

Videos

CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale· slideslive

Taxonomy

TopicsSpecies Distribution and Climate Change · Identification and Quantification in Food

MethodsALIGN · Contrastive Learning