From Thread to Transcontinental Computer: Disturbing Lessons in Distributed Supercomputing
Derek Groen (University College London), Simon Portegies Zwart, (Sterrewacht Leiden)

TL;DR
This paper discusses the technical and political challenges of using a distributed intercontinental supercomputing setup for cosmological simulations, demonstrating high efficiency and advocating for flexible policies to enhance HPC capabilities.
Contribution
It presents a successful implementation of a distributed supercomputing system across continents, achieving up to 93% efficiency, and highlights the importance of flexible user policies for effective grid computing.
Findings
Achieved 93% efficiency in intercontinental supercomputing
Flexible user policies significantly improve grid computing effectiveness
Distributed supercomputing can be scaled using smaller, flexible clusters
Abstract
We describe the political and technical complications encountered during the astronomical CosmoGrid project. CosmoGrid is a numerical study on the formation of large scale structure in the universe. The simulations are challenging due to the enormous dynamic range in spatial and temporal coordinates, as well as the enormous computer resources required. In CosmoGrid we dealt with the computational requirements by connecting up to four supercomputers via an optical network and make them operate as a single machine. This was challenging, if only for the fact that the supercomputers of our choice are separated by half the planet, as three of them are located scattered across Europe and fourth one is in Tokyo. The co-scheduling of multiple computers and the 'gridification' of the code enabled us to achieve an efficiency of up to for this distributed intercontinental supercomputer. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
