A Grammar for Reproducible and Painless Extract-Transform-Load   Operations on Medium Data

Benjamin S. Baumer

arXiv:1708.07073·stat.CO·May 24, 2018

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Benjamin S. Baumer

PDF

3 Repos

TL;DR

This paper introduces a framework that combines R and SQL to facilitate reproducible and efficient extract-transform-load operations on medium-sized data sets, addressing a key challenge in data science workflows.

Contribution

It presents a novel, pipeable framework integrating R and SQL for reproducible data processing of medium data, simplifying workflows and enhancing scientific rigor.

Findings

01

Enables reproducible ETL workflows in R using SQL.

02

Reduces complexity of handling medium data sets.

03

Improves transparency and peer-review of data analysis.

Abstract

Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.