Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel   Governance Mechanisms

Jordan Meyer; Nick Padgett; Cullen Miller; and Laura Exline

arXiv:2410.23144·cs.AI·October 31, 2024

Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms

Jordan Meyer, Nick Padgett, Cullen Miller, and Laura Exline

PDF

Open Access 2 Datasets

TL;DR

Public Domain 12M (PD12M) is the largest high-quality public domain image-text dataset with innovative community governance, enabling safer and more reproducible training of text-to-image models.

Contribution

The paper introduces PD12M, the largest public domain image-text dataset, along with novel community-driven governance mechanisms for dataset management.

Findings

01

Largest public domain image-text dataset to date

02

Community governance mechanisms reduce harm and enhance reproducibility

03

Supports training of foundation models with minimal copyright concerns

Abstract

We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques