# A distributed data warehouse system for astroparticle physics

**Authors:** Minh-Duc Nguyen (1), Alexander Kryukov (1), Julia Dubenskaya (1),, Elena Korosteleva (1), Stanislav Polyakov (1), Evgeny Postnikov (1), Igor, Bychkov (2), Andrey Mikhailov (2), Alexey Shigarov (2), Oleg Fedorov (3),, Yulia Kazarina (3), Dmitry Shipilov (3), Dmitry Zhurov (3) ((1) Lomonosov, Moscow State University, Skobeltsyn Institute of Nuclear Physics, (2), Matrosov Institute for System Dynamics, Control Theory, Siberian Branch of, Russian Academy of Sciences, (3) Applied Physics Institute, Irkutsk State, University)

arXiv: 1812.01906 · 2018-12-10

## TL;DR

This paper presents a distributed data warehouse system tailored for astroparticle physics experiments, enabling efficient on-demand data access and integration across multiple large-scale data sources.

## Contribution

It introduces a novel implementation using CernVM-FS with custom components for data search and subset delivery, enhancing data accessibility for scientists.

## Key findings

- Efficient on-demand data retrieval from multiple experiments.
- User-friendly interface for data access with proper permissions.
- Integration of data sets across experiments for comprehensive analysis.

## Abstract

A distributed data warehouse system is one of the actual issues in the field of astroparticle physics. Famous experiments, such as TAIGA, KASCADE-Grande, produce tens of terabytes of data measured by their instruments. It is critical to have a smart data warehouse system on-site to store the collected data for further distribution effectively. It is also vital to provide scientists with a handy and user-friendly interface to access the collected data with proper permissions not only on-site but also online. The latter case is handy when scientists need to combine data from different experiments for analysis. In this work, we describe an approach to implementing a distributed data warehouse system that allows scientists to acquire just the necessary data from different experiments via the Internet on demand. The implementation is based on CernVM-FS with additional components developed by us to search through the whole available data sets and deliver their subsets to users' computers.

---
Source: https://tomesphere.com/paper/1812.01906