Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster
Niclas Eich, Martin Erdmann, Peter Fackeldey, Benjamin, Fischer, Dennis Noll, Yannik Rath

TL;DR
This paper introduces a cache-aware Dask-based system that accelerates terabyte-scale LHC data analysis by combining vectorized processing, MapReduce scaling, and SSD caching, achieving significant speedups in physics analysis cycles.
Contribution
It presents a novel approach integrating vectorized event processing, MapReduce paradigm, and SSD caching to efficiently analyze large-scale LHC data on small clusters.
Findings
6.3x runtime improvement after one cycle
14.9x overall speedup after 10 cycles
Effective use of SSD caching reduces IO latency
Abstract
The development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate physics analyses on small-size institute clusters. Our solution is based on three key concepts: Vectorized processing of collision events, the "MapReduce" paradigm for scaling out on computing clusters, and efficiently utilized SSD caching to reduce latencies in IO operations. Using simulations from a Higgs pair production physics analysis as an example, we achieve an improvement factor of in runtime after one cycle and even an overall speedup of a factor of after cycles.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Caching and Content Delivery
