Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library
Youjia Li, Robert Latham, Robert Ross, Ankit Agrawal, Alok Choudhary, Wei-Keng Liao

TL;DR
This paper introduces a scalable method for parallel data object creation in high-performance I/O libraries, significantly reducing creation time and memory footprint for large-scale scientific data management.
Contribution
It proposes a novel file header format enabling independent data object creation, improving scalability and efficiency over traditional collective methods.
Findings
Achieved up to 582x faster data object creation on 4096 processes.
Reduced memory footprint per process inversely with number of processes.
Demonstrated scalability for millions of data objects in high-performance environments.
Abstract
High-level I/O libraries, such as HDF5 and PnetCDF, are commonly used by large-scale scientific applications to perform I/O tasks in parallel. These I/O libraries store the metadata such as data types and dimensionality along with the raw data in the same files. While these libraries are well-optimized for concurrent access to the raw data, they are designed neither to handle a large number of data objects efficiently nor to create different data objects independently by multiple processes, as they require applications to call data object creation APIs collectively with consistent metadata among all processes. Applications that process data gathered from remote sensors, such as particle collision experiments in high-energy physics, may generate data of different sizes from different sensors and desire to store them as separate data objects. For such applications, the I/O library's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Scientific Computing and Data Management · Advanced Database Systems and Queries
