Toward real-time data query systems in HEP
Jim Pivarski, David Lange, Thanat Jatuphattharachat

TL;DR
This paper discusses progress in developing real-time data query systems for High Energy Physics, focusing on optimizing data access and calculations for quick, interactive analysis of large datasets using techniques like direct data extraction and Python compilation.
Contribution
It introduces techniques for fast data extraction and analysis in HEP, including direct ROOT TBranch extraction and Python function compilation, tailored for interactive querying.
Findings
Efficient extraction of ROOT TBranches into Numpy arrays.
Compilation of Python analysis functions for rapid execution.
Strategies for caching and preloading data in distributed environments.
Abstract
Exploratory data analysis tools must respond quickly to a user's questions, so that the answer to one question (e.g. a visualized histogram or fit) can influence the next. In some SQL-based query systems used in industry, even very large (petabyte) datasets can be summarized on a human timescale (seconds), employing techniques such as columnar data representation, caching, indexing, and code generation/JIT-compilation. This article describes progress toward realizing such a system for High Energy Physics (HEP), focusing on the intermediate problems of optimizing data access and calculations for "query sized" payloads, such as a single histogram or group of histograms, rather than large reconstruction or data-skimming jobs. These techniques include direct extraction of ROOT TBranches into Numpy arrays and compilation of Python analysis functions (rather than SQL) to be executed very…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
