GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion
Yiwei Yang, Xiangyu Gao, Yuan Zhou, Yuhang Gan, Yusheng Zheng, Andi Quinn

TL;DR
GPUOS is a GPU runtime system that reduces kernel launch overhead by maintaining a persistent kernel with runtime operator injection, significantly accelerating small tensor operations in deep learning workloads.
Contribution
GPUOS introduces a novel persistent kernel architecture with runtime operator injection, enabling efficient execution of diverse small tensor operations without kernel restarts.
Findings
Achieves up to 15.3x speedup over standard PyTorch on small operation workloads.
Supports arbitrary tensor shapes, data types, and broadcasting.
Improves GPU utilization in micro-batched inference and attention workloads.
Abstract
Modern deep learning workloads often consist of many small tensor operations, especially in inference, attention, and micro-batched training. In these settings, kernel launch overhead can become a major bottleneck, sometimes exceeding the actual computation time. We present GPUOS, a GPU runtime JIT system that reduces launch overhead using a persistent kernel architecture with runtime operator injection. GPUOS runs a single long-lived GPU kernel that continuously processes tasks from a host-managed work queue, eliminating repeated kernel launches. To support diverse operations, GPUOS uses NVIDIA NVRTC to just-in-time compile operators at runtime and inject them into the running kernel through device function pointer tables. This design enables operator updates without restarting the kernel or recompiling the system. GPUOS introduces four key ideas: (1) a persistent worker kernel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
