LLAMP: Assessing Network Latency Tolerance of HPC Applications with   Linear Programming

Siyuan Shen; Langwen Huang; Marcin Chrapek; Timo Schneider; Jai Dayal,; Manisha Gajbe; Robert Wisniewski; Torsten Hoefler

arXiv:2404.14193·cs.DC·April 23, 2024

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

Siyuan Shen, Langwen Huang, Marcin Chrapek, Timo Schneider, Jai Dayal,, Manisha Gajbe, Robert Wisniewski, Torsten Hoefler

PDF

Open Access 1 Repo

TL;DR

LLAMP is an analytical tool that efficiently assesses the network latency tolerance of HPC applications using linear programming, providing accurate predictions without specialized hardware.

Contribution

This paper introduces LLAMP, a novel, fast, and accurate method for evaluating HPC applications' latency tolerance using linear programming and the LogGPS model.

Findings

01

LLAMP achieves prediction errors below 2% for MPI applications.

02

The tool effectively evaluates latency tolerance across diverse applications.

03

Case study demonstrates broad applicability in real-world scenarios.

Abstract

The shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. As large-scale MPI applications often exhibit significant differences in their network latency tolerance, it is crucial to accurately determine the extent of network latency an application can withstand without significant performance degradation. Current approaches to assessing this metric often rely on specialized hardware or network simulators, which can be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain that offers an efficient, analytical approach to evaluating HPC applications' network latency tolerance using the LogGPS model and linear programming. LLAMP equips software developers and network architects with essential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spcl/llamp
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques