Building a scalable python distribution for HEP data analysis
David Lange

TL;DR
This paper presents the development of a scalable Python distribution tailored for high-energy physics data analysis, integrating community tools to enhance usability, stability, and reproducibility for both large distributions and individual users.
Contribution
It introduces a Python distribution optimized for HEP analysis that combines community and HEP-specific tools, addressing integration, testing, and sustainability challenges.
Findings
Successful integration of HEP and data science Python packages
Enhanced stability and usability for large-scale distributions
Progress towards a sustainable Python infrastructure for HEP
Abstract
There are numerous approaches to building analysis applications across the high-energy physics community. Among them are Python-based, or at least Python-driven, analysis workflows. We aim to ease the adoption of a Python-based analysis toolkit by making it easier for non-expert users to gain access to Python tools for scientific analysis. Experimental software distributions and individual user analysis have quite different requirements. Distributions tend to worry most about stability, usability and reproducibility, while the users usually strive to be fast and nimble. We discuss how we built and now maintain a python distribution for analysis while satisfying requirements both a large software distribution (in our case, that of CMSSW) and user, or laptop, level analysis. We pursued the integration of tools used by the broader data science community as well as HEP developed (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
