Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

Sangeeta Chowdhary; Ryan Swann; Sean Siddens; Muhammad Osama; Stephen Neuendorffer; Alexandru Dutu; Karthik Sangaiah; Sandeepa Bhuyan; Samuel Bayliss; Ganesh Dasika

arXiv:2604.15379·cs.AR·April 20, 2026

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

Sangeeta Chowdhary, Ryan Swann, Sean Siddens, Muhammad Osama, Stephen Neuendorffer, Alexandru Dutu, Karthik Sangaiah, Sandeepa Bhuyan, Samuel Bayliss, Ganesh Dasika

PDF

TL;DR

Fleet introduces a hierarchical task model for multi-die GPUs that improves cache utilization and reduces memory traffic, leading to significant performance gains in memory-bound workloads.

Contribution

The paper proposes Chiplet-tasks and a persistent kernel runtime to better map computation to GPU chiplet hierarchies, addressing limitations of current programming models.

Findings

01

Fleet achieves 1.3-1.5x lower decode latency than vLLM.

02

L2 hit rate increases from 12% to 54% at batch size 32.

03

Reduces HBM traffic by up to 37% and delivers 1.27-1.30x speedup.

Abstract

Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads to redundant memory traffic and poor cache utilization in memory-bound workloads such as LLM inference. We present Fleet, a multi-level task model that maps computation to memory scopes. Fleet introduces Chiplet-tasks, a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.