InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
Nicklas Neu, Thomas Ebner, Jasmin Primus, Raphael Zefferer, Bernhard Schenkenfelder, Mathias Brunbauer, Florian Kromp

TL;DR
This paper introduces InVitroVision, a multi-modal AI model fine-tuned on limited IVF data to generate natural language descriptions of embryo development, outperforming existing models.
Contribution
It demonstrates the effectiveness of fine-tuning vision-language models on small datasets for IVF embryo description tasks.
Findings
InVitroVision outperforms ChatGPT 5.2 and base models in embryo description accuracy.
Performance improves with larger training datasets.
The approach enables natural language descriptions from limited IVF data.
Abstract
The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
