The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack
Manolis Ploumidis, Fabien Chaix, Nikolaos Chrysos, Marios Assiminakis,, Vassilis Flouris, Nikolaos Kallimanis, Nikolaos Kossifidis, Michael, Nikoloudakis, Polydoros Petrakis, Nikolaos Dimou, Michael Gianioudis, George, Ieronymakis, Aggelos Ioannou, George Kalokerinos

TL;DR
This paper presents the ExaNeSt Prototype, a liquid-cooled FPGA-based HPC rack with custom interconnects and software, demonstrating low-latency communication and scalable performance for exascale computing applications.
Contribution
The paper introduces a novel FPGA-based HPC prototype with a custom interconnect and runtime software, enabling efficient communication and MPI support for exascale system research.
Findings
Single-hop latency of 1.3 microseconds
Bandwidth utilization reaches 82% of theoretical capacity
Custom Allreduce reduces collective latency by up to 88%
Abstract
We present and evaluate the ExaNeSt Prototype, a liquid-cooled rack prototype consisting of 256 Xilinx ZU9EG MPSoCs, 4 TBytes of DRAM, 16 TBytes of SSD, and configurable interconnection 10-Gbps hardware. We developed this testbed in 2016-2019 to validate the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest towards Exascale systems and beyond. We present our key design choices reagrding overall system architecture, PCBs and runtime software, and summarize insights resulting from measurement and analysis. Of particular note, our custom interconnect includes a low-cost low-latency network interface, offering user-level zero-copy RDMA, which we have tightly coupled with the ARMv8 processors in the MPSoCs. We have developed a system software runtime on top of these features, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Cloud Computing and Resource Management · Software-Defined Networks and 5G
