TL;DR
This paper investigates how data owners express consent or restrictions on web-scraped vision-language datasets, revealing that current practices often overlook or ignore these signals, raising ethical and legal concerns.
Contribution
It provides a comprehensive analysis of data consent signals in a large dataset, highlighting gaps in respecting data owners' wishes and proposing the need for a unified consent framework.
Findings
At least 122 million samples show copyright notices.
60% of top domain samples come from sites with prohibitive ToS.
Watermark detection methods fail to identify 9-13% of watermarked samples.
Abstract
The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
