Metadata Archaeology: Unearthing Data Subsets by Leveraging Training   Dynamics

Shoaib Ahmed Siddiqui; Nitarshan Rajkumar; Tegan Maharaj; David; Krueger; Sara Hooker

arXiv:2209.10015·cs.LG·September 22, 2022·6 cites

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David, Krueger, Sara Hooker

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a unified framework called Metadata Archaeology that leverages training dynamics to identify and curate subsets of data with specific properties, improving data quality and diversity handling in machine learning.

Contribution

It proposes a novel, efficient method to infer dataset metadata by analyzing learning dynamics, without relying on prior labels or assumptions.

Findings

01

Effective in identifying mislabeled data

02

Classifies minority-group samples accurately

03

Enables scalable human data auditing

Abstract

Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shoaibahmed/metadata_archaeology
pytorch

Videos

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Image Processing and 3D Reconstruction