Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project
Olof Barring

TL;DR
This paper describes the development of an integrated, open-source framework for automating the management of large computing fabrics within the EU DataGrid project, including configuration, monitoring, installation, and fault tolerance subsystems.
Contribution
It introduces a comprehensive architecture and implementation of fabric management tools that automate configuration, installation, and fault tolerance for large-scale computing environments.
Findings
Subsystems for configuration management and monitoring have been delivered.
An installation and service configuration subsystem based on standard tools is being developed.
The integrated system supports centralized management of large computer farms.
Abstract
The EU DataGrid project workpackage 4 has as an objective to provide the necessary tools for automating the management of medium size to very large computing fabrics. At the end of the second project year subsystems for centralized configuration management (presented at LISA'02) and performance/exception monitoring have been delivered. This will soon be augmented with a subsystem for node installation and service configuration, which is based on existing widely used standards where available (e.g. rpm, kickstart, init.d scripts) and clean interfaces to OS dependent components (e.g. base installation and service management). The three subsystems together allow for centralized management of very large computer farms. Finally, a fault tolerance system is being developed for tying together the above subsystems to form a complete framework for automated enterprise computing management by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques
