Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads
Aleix Roca, Vicen\c{c} Beltran

TL;DR
This paper introduces USF, a user-space process scheduling framework that reduces interference and improves performance in oversubscribed HPC and AI workloads by enabling custom scheduling policies without kernel modifications.
Contribution
The paper presents USF, a novel user-space scheduling framework with a cooperative policy that mitigates interference and enhances performance in complex multi-runtime, oversubscribed environments.
Findings
Up to 2.4x performance gains in oversubscribed scenarios
Effective reduction of scheduling interference issues
Seamless coordination across multiple runtimes
Abstract
The convergence of high-performance computing (HPC) and artificial intelligence (AI) is driving the emergence of increasingly complex parallel applications and workloads. These workloads often combine multiple parallel runtimes within the same application or across co-located jobs, creating scheduling demands that place significant stress on traditional OS schedulers. When oversubscribed (there are more ready threads than cores), OS schedulers rely on periodic preemptions to multiplex cores, often introducing interference that may degrade performance. In this paper, we present: (1) The User-space Scheduling Framework (USF), a novel seamless process scheduling framework completely implemented in user-space. USF enables users to implement their own process scheduling algorithms without requiring special permissions. We evaluate USF with its default cooperative policy, (2) SCHED_COOP,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
