Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
Zizhao Hu, Mohammad Rostami, Jesse Thomason

TL;DR
This paper investigates how model collapse occurs in multi-modal generative systems like VLMs and diffusion models, revealing unique behaviors and proposing mitigation strategies to improve stability in self-improving AI systems.
Contribution
It extends the study of model collapse to multi-modal systems, providing new insights and practical guidelines for maintaining stability in multi-agent synthetic data training.
Findings
Model collapse exhibits distinct characteristics in multi-modal systems.
Increased decoding budgets and model diversity can mitigate collapse.
Multi-modal systems show improved alignment and increased variance during collapse.
Abstract
Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self-generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi-modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and text-to-image diffusion models, as well as recursive generate-train loops with multiple models. We find that model collapse, previously observed in single-modality generative models, exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Natural Language Processing Techniques
