PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Leonardo Brusini; Cristian Sbrolli; Eugenio Lomurno; Toshihiko Yamasaki; Matteo Matteucci

arXiv:2602.01370·cs.CV·February 3, 2026

PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Leonardo Brusini, Cristian Sbrolli, Eugenio Lomurno, Toshihiko Yamasaki, Matteo Matteucci

PDF

Open Access

TL;DR

PolyGen introduces a multi-generator ensemble framework for synthetic vision-language data, emphasizing diversity and compositionality, leading to significant improvements over single-source methods in various benchmarks.

Contribution

It proposes a novel multi-generator ensemble approach with a curriculum for syntactic understanding, enhancing feature diversity and data efficiency in synthetic data generation.

Findings

01

Outperforms single-source baseline by +19.0% on multi-task benchmarks

02

Achieves +9.1% on the SugarCrepe++ compositionality benchmark

03

Demonstrates structural diversity surpasses mere data volume increase

Abstract

Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis