MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
Benjamin Aubin, Gonzalo I\~naki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Cl\'ement Chadebec

TL;DR
MONET is a large, open, and richly annotated text-to-image dataset with 104.9 million pairs, designed to facilitate reproducible research in large-scale text-to-image modeling.
Contribution
The paper introduces MONET, a comprehensive, filtered, and augmented dataset for text-to-image tasks, enabling scalable and reproducible research.
Findings
Training on MONET yields competitive GenEval and DPG scores.
MONET's diverse and high-quality data supports large-scale model training.
The dataset accelerates research by providing pre-computed embeddings and annotations.
Abstract
Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
