CMS Analysis and Data Reduction with Apache Spark
Oliver Gutsche (2), Luca Canali (1), Illia Cremer (4), Matteo, Cremonesi (2), Peter Elmer (5), Ian Fisk (3), Maria Girone (1), Bo Jayatilaka, (2), Jim Kowalkowski (2), Viktor Khristenko (1), Evangelos Motesnitsalis (1),, Jim Pivarski (5), Saba Sehrish (2), Kacper Surdy (1)

TL;DR
This paper explores using Apache Spark for large-scale data reduction and analysis in high-energy physics, demonstrating its potential to improve efficiency and interactivity in processing petabyte-scale datasets.
Contribution
It presents a novel application of Apache Spark in CMS data analysis workflows, including data reduction and physics searches, with initial results on scalability and performance.
Findings
Successful reduction of 1 PB to 1 TB of data using Spark
Comparable performance of Spark-based analysis to ROOT in physics searches
Progress in scaling Spark for high-energy physics data processing
Abstract
Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
