GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

Yiwei Yang; Xiangyu Gao; Yuan Zhou; Yuhang Gan; Yusheng Zheng; Andi Quinn

arXiv:2604.17861·cs.DC·April 21, 2026

GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

Yiwei Yang, Xiangyu Gao, Yuan Zhou, Yuhang Gan, Yusheng Zheng, Andi Quinn

PDF

TL;DR

GPUOS is a GPU runtime system that reduces kernel launch overhead by maintaining a persistent kernel with runtime operator injection, significantly accelerating small tensor operations in deep learning workloads.

Contribution

GPUOS introduces a novel persistent kernel architecture with runtime operator injection, enabling efficient execution of diverse small tensor operations without kernel restarts.

Findings

01

Achieves up to 15.3x speedup over standard PyTorch on small operation workloads.

02

Supports arbitrary tensor shapes, data types, and broadcasting.

03

Improves GPU utilization in micro-batched inference and attention workloads.

Abstract

Modern deep learning workloads often consist of many small tensor operations, especially in inference, attention, and micro-batched training. In these settings, kernel launch overhead can become a major bottleneck, sometimes exceeding the actual computation time. We present GPUOS, a GPU runtime JIT system that reduces launch overhead using a persistent kernel architecture with runtime operator injection. GPUOS runs a single long-lived GPU kernel that continuously processes tasks from a host-managed work queue, eliminating repeated kernel launches. To support diverse operations, GPUOS uses NVIDIA NVRTC to just-in-time compile operators at runtime and inject them into the running kernel through device function pointer tables. This design enables operator updates without restarting the kernel or recompiling the system. GPUOS introduces four key ideas: (1) a persistent worker kernel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.