Data Checklist: On Unit-Testing Datasets with Usable Information

Heidi C. Zhang; Shabnam Behzad; Kawin Ethayarajh; Dan Jurafsky

arXiv:2408.02919·cs.CL·August 7, 2024

Data Checklist: On Unit-Testing Datasets with Usable Information

Heidi C. Zhang, Shabnam Behzad, Kawin Ethayarajh, Dan Jurafsky

PDF

Open Access 1 Repo

TL;DR

This paper introduces a principled, taxonomy-based approach to unit-testing datasets for language models, enabling detection of known and unknown artifacts and improving data efficiency in model alignment.

Contribution

It proposes a novel taxonomy for dataset unit-testing, called data checklists, which systematically identify artifacts and enhance data filtering for better model alignment.

Findings

01

Recovered known artifacts in SNLI dataset

02

Discovered new artifacts in LLM preference datasets

03

Improved data filtering enhances alignment efficacy

Abstract

Model checklists (Ribeiro et al., 2020) have emerged as a useful tool for understanding the behavior of LLMs, analogous to unit-testing in software engineering. However, despite datasets being a key determinant of model behavior, evaluating datasets, e.g., for the existence of annotation artifacts, is largely done ad hoc, once a problem in model behavior has already been found downstream. In this work, we take a more principled approach to unit-testing datasets by proposing a taxonomy based on the V-information literature. We call a collection of such unit tests a data checklist. Using a checklist, not only are we able to recover known artifacts in well-known datasets such as SNLI, but we also discover previously unknown artifacts in preference datasets for LLM alignment. Data checklists further enable a new kind of data filtering, which we use to improve the efficacy and data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ChenyuHeidiZhang/data_checklist
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Software System Performance and Reliability