Predicting dataset popularity for the CMS experiment
Valentin Kuznetsov, Ting Li, Luca Giommi, Daniele Bonacorsi, Tony, Wildish

TL;DR
This paper presents an analysis of dataset popularity in the CMS experiment at CERN, aiming to improve data management and system throughput through predictive modeling based on metadata.
Contribution
It introduces a novel data-driven approach to predict dataset popularity, aiding dynamic data placement and infrastructure management in high-energy physics computing.
Findings
Identified patterns in dataset usage and popularity
Proposed a model for predicting dataset demand
Lays groundwork for optimizing data storage and retrieval
Abstract
The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a first CMS analysis of dataset popularity based on CMS meta-data which can be used as a model for dynamic data placement and provide the foundation of data-driven approach for the CMS computing infrastructure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
