Evaluation of Dataframe Libraries for Data Preparation on a Single Machine
Angelo Mozzillo, Luca Zecchini, Luca Gagliardelli, Adeel Aslam, Sonia, Bergamaschi, Giovanni Simonini

TL;DR
This paper evaluates popular Python dataframe libraries for data preparation on a single machine, comparing their performance across various datasets and scenarios to guide practitioners in selecting the most suitable tool.
Contribution
It provides a comprehensive performance comparison of Pandas, Polars, CuDF, and PySpark for data preparation tasks on a single machine, highlighting their strengths and optimal use cases.
Findings
Pandas is best for small datasets with rich API.
Polars excels when data fits in RAM without Pandas API compatibility.
CuDF offers top performance with GPU acceleration.
Abstract
Data preparation is a trial-and-error process that typically involves countless iterations over the data to define the best pipeline of operators for a given task. With tabular data, practitioners often perform that burdensome activity on local machines by writing ad hoc scripts with libraries based on the Pandas dataframe API and testing them on samples of the entire dataset-the faster the library, the less idle time its users have. In this paper, we evaluate the most popular Python dataframe libraries in general data preparation use cases to assess how they perform on a single machine. To do so, we employ 4 real-world datasets with heterogeneous features, covering a variety of scenarios, and the TPC-H benchmark. The insights gained with this experimentation are useful to data scientists who need to choose which of the dataframe libraries best suits their data preparation task at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Advanced Data Storage Technologies · Scientific Computing and Data Management
