DataLab: A Platform for Data Analysis and Intervention
Yang Xiao, Jinlan Fu, Weizhe Yuan, Vijay Viswanathan, Zhoumianze Liu,, Yixin Liu, Graham Neubig, Pengfei Liu

TL;DR
DataLab is a comprehensive platform designed to facilitate data analysis, manipulation, and understanding for machine learning datasets, supporting researchers with tools for exploration, processing, and ecosystem overview.
Contribution
The paper introduces DataLab, a unified platform offering interactive data analysis, standardized processing interfaces, and dataset ecosystem insights, covering thousands of datasets and annotations.
Findings
Supports analysis of 728 datasets with 140M samples.
Provides tools for bias detection and data transformation.
Includes dataset recommendation and global ecosystem analysis.
Abstract
Despite data's crucial role in machine learning, most existing tools and research tend to focus on systems on top of existing data rather than how to interpret and manipulate data. In this paper, we propose DataLab, a unified data-oriented platform that not only allows users to interactively analyze the characteristics of data, but also provides a standardized interface for different data processing operations. Additionally, in view of the ongoing proliferation of datasets, \toolname has features for dataset recommendation and global vision analysis that help researchers form a better view of the data ecosystem. So far, DataLab covers 1,715 datasets and 3,583 of its transformed version (e.g., hyponyms replacement), where 728 datasets support various analyses (e.g., with respect to gender bias) with the help of 140M samples annotated by 318 feature functions. DataLab is under active…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Machine Learning in Healthcare
