TL;DR
This paper discusses the challenges of scaling dataframe systems, proposes a unified data model, and reports on building MODIN, a scalable implementation of Python's pandas, highlighting open research directions.
Contribution
It introduces a simple data model and algebra for dataframes, and presents the development of MODIN as a scalable pandas alternative.
Findings
MODIN demonstrates improved scalability over pandas
A formal data model for dataframes is proposed
Open research questions in dataframe semantics and performance
Abstract
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building MODIN, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
