WILDS: A Benchmark of in-the-Wild Distribution Shifts

Pang Wei Koh; Shiori Sagawa; Henrik Marklund; Sang Michael Xie; Marvin; Zhang; Akshay Balsubramani; Weihua Hu; Michihiro Yasunaga; Richard Lanas; Phillips; Irena Gao; Tony Lee; Etienne David; Ian Stavness; Wei Guo; Berton; A. Earnshaw; Imran S. Haque; Sara Beery; Jure Leskovec; Anshul Kundaje; Emma; Pierson; Sergey Levine; Chelsea Finn; Percy Liang

arXiv:2012.07421·cs.LG·July 19, 2021·286 cites

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin, Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas, Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton, A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec

PDF

Open Access 5 Repos 1 Datasets 1 Video

TL;DR

WILDS introduces a comprehensive benchmark of 10 real-world datasets exhibiting diverse distribution shifts, highlighting the challenges in achieving robust ML performance across varied practical scenarios.

Contribution

The paper presents WILDS, a new benchmark with datasets reflecting real-world distribution shifts, and provides tools for standardized evaluation and fostering development of robust models.

Findings

01

Standard models perform poorly on distribution-shifted data.

02

Existing methods do not fully mitigate performance gaps.

03

WILDS facilitates research on robustness to distribution shifts.

Abstract

Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

shlomihod/civil-comments-wilds
dataset· 313 dl
313 dl

Videos

WILDS: A Benchmark of in-the-Wild Distribution Shifts· slideslive

Taxonomy

TopicsAI in cancer detection · Machine Learning and Data Classification · COVID-19 diagnosis using AI