Data Portraits: Recording Foundation Model Training Data

Marc Marone; Benjamin Van Durme

arXiv:2303.03919·cs.LG·December 15, 2023·5 cites

Data Portraits: Recording Foundation Model Training Data

Marc Marone, Benjamin Van Durme

PDF

Open Access 2 Videos

TL;DR

This paper introduces Data Portraits, a lightweight, efficient method for recording and inspecting training data of foundation models, enhancing transparency and enabling detection of data leakage and plagiarism.

Contribution

It proposes a novel data sketching approach for creating Data Portraits, facilitating fast, space-efficient inspection of training datasets for foundation models.

Findings

01

Enabled detection of test set leakage

02

Allowed identification of model plagiarism

03

Cost only 3% of dataset size in overhead

Abstract

Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular language modeling corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The biggest week in AI (GPT-4, Office Copilot, Google PaLM, Anthropic Claude & more)· youtube

Data Portraits: Recording Foundation Model Training Data· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research