Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches
Julian Bellavita, Alex Sim, Kesheng Wu, Inder Monga, Chin Guok, Frank, W\"urthwein, Diego Davila

TL;DR
This study analyzes data access patterns in the XRootD distributed storage cache used in high-energy physics, revealing insights that inform optimal cache sizing to improve performance.
Contribution
It provides a detailed analysis of cache access patterns and introduces a cache simulator to evaluate the impact of cache size on hit rates.
Findings
Increasing cache size from 40TB to 56TB significantly improves hit rate.
File read operation size grows over time, while read frequency remains constant.
Files tend to have consistent open durations, aiding cache modeling.
Abstract
The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regarding file read operations, file lifetimes, and file transfers. We observe that the number of read operations on each file remains nearly constant, while the average size of a read operation grows over time. Furthermore, files tend to have a consistent length of time during which they remain open and are in use. Based on this comprehensive study of the cache access statistics, we developed a cache simulator to explore the behavior of caches of different sizes. Within a certain size range, we find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
