A Modular and Fault-Tolerant Data Transport Framework
Timm M. Steinbeck

TL;DR
This paper presents a modular, flexible, and fault-tolerant data transport framework designed for high-throughput, large-scale cluster systems in particle physics experiments, with promising performance results.
Contribution
It introduces a novel, component-based data transport framework with runtime configurability and fault tolerance for large-scale scientific data acquisition systems.
Findings
Framework achieves high data throughput meeting ALICE requirements
Component-based design allows flexible configuration and extension
Fault-tolerance mechanisms enable reliable operation in large clusters
Abstract
The High Level Trigger (HLT) of the future ALICE heavy-ion experiment has to reduce its input data rate of up to 25 GB/s to at most 1.25 GB/s for output before the data is written to permanent storage. To cope with these data rates a large PC cluster system is being designed to scale to several 1000 nodes, connected by a fast network. For the software that will run on these nodes a flexible data transport and distribution software framework, described in this thesis, has been developed. The framework consists of a set of separate components, that can be connected via a common interface. This allows to construct different configurations for the HLT, that are even changeable at runtime. To ensure a fault-tolerant operation of the HLT, the framework includes a basic fail-over mechanism that allows to replace whole nodes after a failure. The mechanism will be further expanded in the future,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
