BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

Ziheng Zhang; Xinyue Ma; Arpita Chowdhury; Elizabeth G. Campolongo; Matthew J. Thompson; Net Zhang; Samuel Stevens; Hilmar Lapp; Tanya Berger-Wolf; Yu Su; Wei-Lun Chao; Jianyang Gu

arXiv:2510.20095·cs.CV·March 3, 2026

BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu

PDF

Open Access 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces BioCAP, a biological foundation model trained with synthetic captions generated by multimodal large language models, which enhances species classification and retrieval by leveraging descriptive captions beyond traditional labels.

Contribution

The work presents a novel approach to generate accurate, instance-specific synthetic captions for biological images, improving multimodal foundation models in organismal biology.

Findings

01

BioCAP outperforms label-only models in species classification.

02

Synthetic captions improve text-image retrieval accuracy.

03

Guided caption generation reduces hallucination in descriptions.

Abstract

This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The paper proposes interesting tweaks like using images and captions as complementary views of a species’ latent trait vector and training with contrastive learning to emphasize diagnostic features. The dual projector cleanly separates heterogeneous supervision as compared to a single projector. The paper will be of reasonable interest for the broader community interested in training VLMs for life sciences.

Weaknesses

Captions are biased toward the chosen InternVL3-38B; and there is no cross-MLLM comparison. Behavior labels for analysis are auto-assigned by GPT-4o that may cause label drift. Large deltas are shown in experimental results, but statistical significance/confidence intervals aren’t reported for benchmarks.

Reviewer 02Rating 6Confidence 5

Strengths

- Multimodal alignment in biology is an under-explored and important task. - The paper propose an interesting idea that images and captions are treated as complementary projections of a species’ latent morphospace, so aligning them helps capture diagnostic traits while suppressing noise. - The paper introduced a dual-projector architecture elegantly separates taxonomy vs. caption supervision, and use context-guided caption generation (Wikipedia + format examples) effectively mitigates LLM halluc

Weaknesses

- The caption reliance on Wikipedia-derived descriptors could reinforce taxonomic bias and exclude rare or poorly documented species. How does the framework handle species without any Wikipedia entry or minimal trait descriptions? - The repeated LLM re-generation could produce inconsistent style or attribute focus across species, causing the semantic drift. - Potential scalable issue since the caption generation and derail-view training are computational costly.

Reviewer 03Rating 8Confidence 4

Strengths

- Originality: the authors use an interesting approach to caption generation, prompting MLLMs with format specification on a per-class basis. The idea was to constrain output to focus on salient morphological description that can be challenging to pick out of raw text without guidance. The approach to leveraging the captions in BioCLIP is appealingly simple. - Quality: The overall qualtiy is quite good, lots of experiments executed with sufficient data. The ablation studies do a nice job of ill

Weaknesses

The format example design discussion needs expansion. This seems to be a critical element of the work and it is treated very narrowly. It isn't clear what the 'classes' are that were used to query Gemini Deep Research, who did the winnowing of the results, or how consistent the examples were across the classes. This element may itself benefit from an exploration of how variable those exemplars are between model runs and how consistent the human overseers where in selecting appropriate descriptio

Code & Models

Models

🤗
imageomics/biocap
model· 106 dl
106 dl

Datasets

imageomics/TreeOfLife-10M-Captions
dataset· 128 dl
128 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Multimodal Machine Learning Applications · Cell Image Analysis Techniques