Mitigating context switching in densely packed Linux clusters with Latency-Aware Group Scheduling
Al Amjad Tawfiq Isstaif, Evangelia Kalyvianaki, Richard Mortier

TL;DR
This paper identifies how CPU context switching in densely packed Linux clusters hampers performance and proposes kernel scheduler modifications that reduce overhead, enabling smaller, more efficient clusters for serverless workloads.
Contribution
It introduces kernel scheduler modifications that mitigate context switching overhead, improving cluster efficiency and reducing resource requirements in densely packed Linux environments.
Findings
Achieved 28% smaller cluster size with proposed scheduler changes.
Reduced context switching overhead significantly in dense workloads.
Improved task completion times by prioritizing task draining.
Abstract
Cluster orchestrators such as Kubernetes depend on accurate estimates of node capacity and job requirements. Inaccuracies in either lead to poor placement decisions and degraded cluster performance. In this paper, we show that in densely packed workloads, such as serverless applications, CPU context switching overheads can become so significant that a node's performance is severely degraded, even when the orchestrator placement is theoretically sound. In practice this issue is typically mitigated by over-provisioning the cluster, leading to wasted resources. We show that these context switching overhead arise from both an increase in the average cost of an individual context switch and a higher rate of context switching, which together amplify overhead multiplicatively when managing large numbers of concurrent cgroups, Linux's group scheduling mechanism for managing multi-threaded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Software System Performance and Reliability
