Data Formats in Analytical DBMSs: Performance Trade-offs and Future Directions
Chunwei Liu, Anna Pavlenko, Matteo Interlandi, Brandon Haynes

TL;DR
This paper systematically evaluates Apache Arrow, Parquet, and ORC formats for analytical DBMSs, highlighting their trade-offs, limitations for ML tasks, and suggesting directions for unified data format design to improve performance.
Contribution
It provides a comprehensive comparison of popular data formats in OLAP DBMSs, identifying their strengths, weaknesses, and opportunities for co-designing a unified data representation.
Findings
Each format has specific trade-offs affecting query efficiency.
None of the formats perform optimally for certain machine learning tasks.
Opportunities exist to develop a unified in-memory and on-disk data format.
Abstract
This paper evaluates the suitability of Apache Arrow, Parquet, and ORC as formats for subsumption in an analytical DBMS. We systematically identify and explore the high-level features that are important to support efficient querying in modern OLAP DBMSs and evaluate the ability of each format to support these features. We find that each format has trade-offs that make it more or less suitable for use as a format in a DBMS and identify opportunities to more holistically co-design a unified in-memory and on-disk data representation. Notably, for certain popular machine learning tasks, none of these formats perform optimally, highlighting significant opportunities for advancing format design. Our hope is that this study can be used as a guide for system developers designing and using these formats, as well as provide the community with directions to pursue for improving these common open…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Cloud Computing and Resource Management · Advanced Database Systems and Queries
