Cloudy Forecast: How Predictable is Communication Latency in the Cloud?
Owen Hilyard, Bocheng Cui, Marielle Webster, Abishek Bangalore, Muralikrishna, Aleksey Charapko

TL;DR
This paper introduces Cloud Latency Tester (CLT), a tool for measuring communication delay variability in cloud environments, highlighting its importance for designing reliable distributed systems amid cloud-induced latency unpredictability.
Contribution
The paper presents CLT, a practical tool for measuring cloud communication delays, and provides an empirical analysis across major cloud providers to inform system design.
Findings
Communication delays vary significantly across cloud providers.
Cloud latency variability impacts system timing assumptions.
Lessons learned from deploying CLT in real cloud environments.
Abstract
Many systems and services rely on timing assumptions for performance and availability to perform critical aspects of their operation, such as various timeouts for failure detectors or optimizations to concurrency control mechanisms. Many such assumptions rely on the ability of different components to communicate on time -- a delay in communication may trigger the failure detector or cause the system to enter a less-optimized execution mode. Unfortunately, these timing assumptions are often set with little regard to actual communication guarantees of the underlying infrastructure -- in particular, the variability of communication delays between processes in different nodes/servers. The higher communication variability holds especially true for systems deployed in the public cloud since the cloud is a utility shared by many users and organizations, making it prone to higher performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Distributed systems and fault tolerance
