BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu

TL;DR
This paper introduces BioCAP, a biological foundation model trained with synthetic captions generated by multimodal large language models, which enhances species classification and retrieval by leveraging descriptive captions beyond traditional labels.
Contribution
The work presents a novel approach to generate accurate, instance-specific synthetic captions for biological images, improving multimodal foundation models in organismal biology.
Findings
BioCAP outperforms label-only models in species classification.
Synthetic captions improve text-image retrieval accuracy.
Guided caption generation reduces hallucination in descriptions.
Abstract
This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific…
Peer Reviews
Decision·ICLR 2026 Poster
The paper proposes interesting tweaks like using images and captions as complementary views of a species’ latent trait vector and training with contrastive learning to emphasize diagnostic features. The dual projector cleanly separates heterogeneous supervision as compared to a single projector. The paper will be of reasonable interest for the broader community interested in training VLMs for life sciences.
Captions are biased toward the chosen InternVL3-38B; and there is no cross-MLLM comparison. Behavior labels for analysis are auto-assigned by GPT-4o that may cause label drift. Large deltas are shown in experimental results, but statistical significance/confidence intervals aren’t reported for benchmarks.
- Multimodal alignment in biology is an under-explored and important task. - The paper propose an interesting idea that images and captions are treated as complementary projections of a species’ latent morphospace, so aligning them helps capture diagnostic traits while suppressing noise. - The paper introduced a dual-projector architecture elegantly separates taxonomy vs. caption supervision, and use context-guided caption generation (Wikipedia + format examples) effectively mitigates LLM halluc
- The caption reliance on Wikipedia-derived descriptors could reinforce taxonomic bias and exclude rare or poorly documented species. How does the framework handle species without any Wikipedia entry or minimal trait descriptions? - The repeated LLM re-generation could produce inconsistent style or attribute focus across species, causing the semantic drift. - Potential scalable issue since the caption generation and derail-view training are computational costly.
- Originality: the authors use an interesting approach to caption generation, prompting MLLMs with format specification on a per-class basis. The idea was to constrain output to focus on salient morphological description that can be challenging to pick out of raw text without guidance. The approach to leveraging the captions in BioCLIP is appealingly simple. - Quality: The overall qualtiy is quite good, lots of experiments executed with sufficient data. The ablation studies do a nice job of ill
The format example design discussion needs expansion. This seems to be a critical element of the work and it is treated very narrowly. It isn't clear what the 'classes' are that were used to query Gemini Deep Research, who did the winnowing of the results, or how consistent the examples were across the classes. This element may itself benefit from an exploration of how variable those exemplars are between model runs and how consistent the human overseers where in selecting appropriate descriptio
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Multimodal Machine Learning Applications · Cell Image Analysis Techniques
