Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Xieji Li; Siyuan Yan; Yingsheng Liu; H. Peter Soyer; Monika Janda; Victoria Mar; and Zongyuan Ge

arXiv:2512.03445·cs.CV·December 4, 2025

Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda, Victoria Mar, and Zongyuan Ge

PDF

Open Access

TL;DR

This paper introduces a novel vision-language pretraining framework for medical images that enhances data quality and handles unstructured texts by using multi-agent data generation and ontology-based knowledge decomposition, achieving state-of-the-art results.

Contribution

It proposes a multi-agent data generation system and ontology-based knowledge enhancement for medical vision-language pretraining, addressing data noise and long text challenges.

Findings

01

Achieves state-of-the-art zero-shot disease classification.

02

Improves cross-modal retrieval performance.

03

Validates effectiveness through comprehensive dermatology experiments.

Abstract

Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare