A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis
Sunwoo Lee, Kai-yuan Hou, Kewei Wang, Saba Sehrish, Marc Paterno,, James Kowalkowski, Quincey Koziol, Robert Ross, Ankit Agrawal, Alok, Choudhary, Wei-keng Liao

TL;DR
This paper presents a case study on optimizing parallel HDF5 data aggregation for high energy physics, focusing on reducing time, improving compression, and enabling efficient large-scale data analysis.
Contribution
It explores parallel I/O strategies and HDF5 features to enhance data aggregation and access efficiency in large-scale high energy physics datasets.
Findings
Parallel I/O strategies reduce aggregation time
Effective compression improves storage efficiency
Optimized data access enables scalable analysis
Abstract
In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Big Data Technologies and Applications
