# Ambitious Data Science Can Be Painless

**Authors:** Hatef Monajemi, Riccardo Murri, Eric Jonas, Percy Liang, Victoria, Stodden, David L. Donoho

arXiv: 1901.08705 · 2019-01-28

## TL;DR

This paper discusses how emerging software stacks simplify large-scale cloud data science experiments, making ambitious computational research more accessible and less burdensome for researchers.

## Contribution

It introduces and analyzes software stacks that abstract cloud complexities, enabling easier execution and management of massive data science experiments.

## Key findings

- Cloud computing enables orders of magnitude more experimentation.
- New software stacks reduce complexity and researcher burnout.
- Facilitates widespread adoption of ambitious data-driven research.

## Abstract

Modern data science research can involve massive computational experimentation; an ambitious PhD in computational fields may do experiments consuming several million CPU hours. Traditional computing practices, in which researchers use laptops or shared campus-resident resources, are inadequate for experiments at the massive scale and varied scope that we now see in data science. On the other hand, modern cloud computing promises seemingly unlimited computational resources that can be custom configured, and seems to offer a powerful new venue for ambitious data-driven science. Exploiting the cloud fully, the amount of work that could be completed in a fixed amount of time can expand by several orders of magnitude.   As potentially powerful as cloud-based experimentation may be in the abstract, it has not yet become a standard option for researchers in many academic disciplines. The prospect of actually conducting massive computational experiments in today's cloud systems confronts the potential user with daunting challenges. Leading considerations include: (i) the seeming complexity of today's cloud computing interface, (ii) the difficulty of executing an overwhelmingly large number of jobs, and (iii) the difficulty of monitoring and combining a massive collection of separate results. Starting a massive experiment `bare-handed' seems therefore highly problematic and prone to rapid `researcher burn out'.   New software stacks are emerging that render massive cloud experiments relatively painless. Such stacks simplify experimentation by systematizing experiment definition, automating distribution and management of tasks, and allowing easy harvesting of results and documentation. In this article, we discuss several painless computing stacks that abstract away the difficulties of massive experimentation, thereby allowing a proliferation of ambitious experiments for scientific discovery.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.08705/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1901.08705/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/1901.08705/full.md

---
Source: https://tomesphere.com/paper/1901.08705