BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna; Sara Si-Moussi; Wilfried Thuiller; Hadrien Hendrikx; Vincent Miele

arXiv:2511.21194·cs.CV·March 10, 2026

BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele

PDF

Open Access 3 Reviews

TL;DR

BotaCLIP is a lightweight contrastive learning framework that adapts pre-trained Earth Observation models to incorporate botanical knowledge, improving ecological predictions without extensive retraining.

Contribution

It introduces a novel domain-specific adaptation method for foundation models that enhances ecological data analysis through contrastive learning with regularization.

Findings

01

Improved plant presence prediction accuracy

02

Enhanced butterfly occurrence modeling

03

Better soil trophic group estimation

Abstract

Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relev\'es. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. Adapting foundation models to specialized domains (e.g., in data-scarce fields like Ecology) without expensive retraining is highly relevant. The paper clearly articulates why generic EO embeddings are insufficient for fine-grained biodiversity modeling. 2. This paper evaluates across three ecologically diverse tasks spanning different taxonomic groups (plants, insects, soil) and prediction types (classification, habitat suitability, abundance regression), which demonstrates meaningful transf

Weaknesses

1. The approach essentially applies CLIP-style contrastive learning with a similarity-preserving regularization term. It represents an incremental contribution rather than a methodological improvement. Additionally, the authors acknowledge that DinoV3 introduces a closely related "Gram anchoring loss,'' which raises questions about the novelty of the contribution. 2. The $+1.8$\% improvement in Spearman's $\rho$ for soil monitoring (Table-1) might look significant but practically small. While th

Reviewer 02Rating 2Confidence 4

Strengths

- Interesting application domain - A model architecture suited for the task - The paper is generally well written; I could follow most of the parts easily (except for the ablation study)

Weaknesses

- Lack of novelty: My main concern is a lack of novelty. The paper depicts a decent work, but is based on well-known techniques. The application is an interesting one, but overall, the contribution is not sufficient to meet the very high standards of ICLR in my opinion. - Results: The experimental comparison is only based on DOFA embeddings and BotaSP, and the improvement over these baselines is relatively small (e.g., for Plant, the BWiAuSclR-F1 scores are close to the simple BotaSP baseline)

Reviewer 03Rating 0Confidence 4

Strengths

1. The paper is easy to understand and proposes a simple contrastive-based technique to adapt DOFA to be botany-aware with a regularization loss to maintain the original DOFA embeddings space.

Weaknesses

1. I believe the paper has very limited technical novelty given there are several works already aligning satellite image representations with wildlife observations and descriptions. They authors should compare their work with WildSat, TaxaBind and EcoWikiRS and clearly discuss how their work is technically different from these works. 2. In terms of the method itself, the paper uses existing architectures and losses. The authors could have explored some domain-specific design choices such as hand

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpecies Distribution and Climate Change · Domain Adaptation and Few-Shot Learning · Animal Vocal Communication and Behavior