Call Scheduling to Reduce Response Time of a FaaS System

Pawe{\l} \.Zuk; Bart{\l}omiej Przybylski; Krzysztof Rzadca

arXiv:2207.13168·cs.DC·November 1, 2022·1 cites

Call Scheduling to Reduce Response Time of a FaaS System

Pawe{\l} \.Zuk, Bart{\l}omiej Przybylski, Krzysztof Rzadca

PDF

Open Access

TL;DR

This paper introduces a worker-level scheduling method for FaaS systems that significantly reduces response times under heavy loads without adding more nodes, by queuing requests based on history and limiting CPU usage per request.

Contribution

The paper proposes a novel scheduling approach inspired by HPC techniques, replacing OS preemption with request queuing and CPU limiting, improving FaaS response times under load.

Findings

01

Average response time decreased by a factor of 4.

02

Average request stretch decreased by a factor of 18.

03

Fewer machines needed for better response-time statistics.

Abstract

In an overloaded FaaS cluster, individual worker nodes strain under lengthening queues of requests. Although the cluster might be eventually horizontally-scaled, adding a new node takes dozens of seconds. As serving applications are tuned for tail serving latencies, and these greatly increase under heavier loads, the current workaround is resource over-provisioning. In fact, even though a service can withstand a steady load of, e.g., 70% CPU utilization, the autoscaler is triggered at, e.g., 30-40% (thus the service uses twice as many nodes as it would be needed). We propose an alternative: a worker-level method handling heavy load without increasing the number of nodes. FaaS executions are not interactive, compared to, e.g., text editors: end-users do not benefit from the CPU allocated to processes often, yet for short periods. Inspired by scheduling methods for High Performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Distributed systems and fault tolerance