Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models

Zizhao Hu; Mohammad Rostami; Jesse Thomason

arXiv:2505.08803·cs.LG·May 15, 2025

Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models

Zizhao Hu, Mohammad Rostami, Jesse Thomason

PDF

Open Access

TL;DR

This paper investigates how model collapse occurs in multi-modal generative systems like VLMs and diffusion models, revealing unique behaviors and proposing mitigation strategies to improve stability in self-improving AI systems.

Contribution

It extends the study of model collapse to multi-modal systems, providing new insights and practical guidelines for maintaining stability in multi-agent synthetic data training.

Findings

01

Model collapse exhibits distinct characteristics in multi-modal systems.

02

Increased decoding budgets and model diversity can mitigate collapse.

03

Multi-modal systems show improved alignment and increased variance during collapse.

Abstract

Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self-generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi-modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and text-to-image diffusion models, as well as recursive generate-train loops with multiple models. We find that model collapse, previously observed in single-modality generative models, exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Natural Language Processing Techniques