Parallel Writing of Nested Data in Columnar Formats
Jonas Hahnfeld, Jakob Blomer, Thorsten Kollegger

TL;DR
This paper presents a scalable multithreaded method for writing nested, variable-sized data in columnar formats, specifically improving parallelism in high-energy physics data storage without sacrificing compatibility.
Contribution
It introduces a novel parallel writing approach for nested columnar data in ROOT's RNTuple format, overcoming previous bottlenecks and enabling scalable high-performance data storage.
Findings
Achieves near-perfect scalability limited only by storage bandwidth.
Compatible with existing ROOT RNTuple format.
Demonstrates benefits in real-world dataset skimming.
Abstract
High Energy Physics (HEP) experiments, for example at the Large Hadron Collider (LHC) at CERN, store data at exabyte scale in sets of files. They use a binary columnar data format by the ROOT framework, that also transparently compresses the data. In this format, cells are not necessarily atomic but they may contain nested collections of variable size. The fact that row and block sizes are not known upfront makes it challenging to implement efficient parallel writing. In particular, the data cannot be organized in a regular grid where it is possible to precompute indices and offsets for independent writing. In this paper, we propose a scalable approach to efficient multithreaded writing of nested data in columnar format into a single file. Our approach removes the bottleneck of a single writer while staying fully compatible with the compressed, columnar, variably row-sized data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
