Do GPUs Really Need New Tabular File Formats?
Jigao Luo, Qi Chen, Carsten Binnig

TL;DR
This paper investigates how Parquet file configurations impact GPU-accelerated data processing performance, revealing that proper GPU-aware tuning can significantly enhance read bandwidth without changing the format.
Contribution
It demonstrates that Parquet's GPU performance bottlenecks are due to configuration choices and proposes GPU-aware configurations to optimize performance.
Findings
GPU-aware configurations improve read bandwidth up to 125 GB/s
Suboptimal default configurations hinder GPU scan performance
Proper tuning can unlock Parquet's full potential for GPU workloads
Abstract
Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management
