Data Engineering for Everyone

Vijay Janapa Reddi; Greg Diamos; Pete Warden; Peter Mattson; David; Kanter

arXiv:2102.11447·cs.LG·February 24, 2021·5 cites

Data Engineering for Everyone

Vijay Janapa Reddi, Greg Diamos, Pete Warden, Peter Mattson, David, Kanter

PDF

Open Access

TL;DR

The paper emphasizes the importance of open-source data sets in accelerating machine learning research and discusses the potential of automatic data set generation tools to address data scarcity.

Contribution

It highlights the critical role of open data sets in ML innovation and explores the potential of automation to enhance data set creation.

Findings

01

Open data sets are widely used in top AI research.

02

Scarcity of accessible open data sets limits ML progress.

03

Automatic data set generation could mitigate data scarcity.

Abstract

Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce, which presents a severe challenge to ML deployment at scale. Much like the software-engineering revolution, where mass adoption of open-source software replaced the closed, in-house development model for infrastructure code, there is a growing need to enable rapid development and open contribution to massive machine learning data sets. This article shows that open-source data sets are the rocket fuel for research and innovation at even some of the largest AI organizations. Our analysis of nearly 2000 research publications from Facebook, Google and Microsoft over the past five years shows the widespread use and adoption of open data sets. Open data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices · Distributed and Parallel Computing Systems

MethodsRandom Convolutional Kernel Transform