PandasBench: A Benchmark for the Pandas API
Alex Broihier, Stefanos Baziotis, Daniel Kang, Charith Mendis

TL;DR
PandasBench is the first comprehensive benchmark for the Pandas API, evaluating real-world performance and coverage of various data processing techniques across multiple tools using a large, scaled dataset.
Contribution
It introduces PandasBench, a novel benchmark tailored for the Pandas API, including input scaling and real-world code evaluation, filling a gap in existing benchmarking tools.
Findings
Modin shows up to 8% speedup on real-world code.
Dask and Koalas show minimal to no speedup.
Dias achieves speedups but rewrites code incorrectly in some cases.
Abstract
The Pandas API has been central to the success of pandas and its alternatives. Despite its importance, there is no benchmark for it, and we argue that we cannot repurpose existing benchmarks (from other domains) for the Pandas API. In this paper, we introduce requirements that are necessary for a Pandas API enchmark, and present the first benchmark that fulfills them: PandasBench. We argue that it should evaluate the real-world coverage of a technique. Yet, real-world coverage is not sufficient for a useful benchmark, and so we also: cleaned it from irrelevant code, adapted it for benchmark usage, and introduced input scaling. We claim that uniform scaling used in other benchmarks (e.g., TPC-H) is too coarse-grained for PandasBench, and use a non-uniform scaling scheme. PandasBench is the largest Pandas API benchmark to date, with 102 notebooks and 3,721 cells. We used PandasBench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Software System Performance and Reliability
