Predicting dataset popularity for the CMS experiment

Valentin Kuznetsov; Ting Li; Luca Giommi; Daniele Bonacorsi; Tony; Wildish

arXiv:1602.07226·physics.data-an·December 21, 2016

Predicting dataset popularity for the CMS experiment

Valentin Kuznetsov, Ting Li, Luca Giommi, Daniele Bonacorsi, Tony, Wildish

PDF

TL;DR

This paper presents an analysis of dataset popularity in the CMS experiment at CERN, aiming to improve data management and system throughput through predictive modeling based on metadata.

Contribution

It introduces a novel data-driven approach to predict dataset popularity, aiding dynamic data placement and infrastructure management in high-energy physics computing.

Findings

01

Identified patterns in dataset usage and popularity

02

Proposed a model for predicting dataset demand

03

Lays groundwork for optimizing data storage and retrieval

Abstract

The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a first CMS analysis of dataset popularity based on CMS meta-data which can be used as a model for dynamic data placement and provide the foundation of data-driven approach for the CMS computing infrastructure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.