Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

Tiancheng Hu; Jin Qin; Zheng Wang; Junhao Hu; Yuzheng Wang; Lei Chen; Yizhou Shan; Mingxing Zhang; Ting Cao; Chunwei Xia; Huimin Cui; Tao Xie; Chenxi Wang

arXiv:2604.10180·cs.DC·April 14, 2026

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

Tiancheng Hu, Jin Qin, Zheng Wang, Junhao Hu, Yuzheng Wang, Lei Chen, Yizhou Shan, Mingxing Zhang, Ting Cao, Chunwei Xia, Huimin Cui, Tao Xie, Chenxi Wang

PDF

TL;DR

Tessera is a novel kernel disaggregation system that enhances performance and cost efficiency for large model inference on heterogeneous GPUs by aligning kernels with hardware capabilities.

Contribution

It introduces the first kernel-level disaggregation approach that adapts to diverse resource demands within applications, improving efficiency over existing coarse-grained methods.

Findings

01

Up to 2.3x increase in serving throughput.

02

Up to 1.6x improvement in cost efficiency.

03

Heterogeneous GPU pairs can outperform homogeneous high-end GPU setups.

Abstract

Disaggregation maps parts of an AI workload to different types of GPUs, offering a path to utilize modern heterogeneous GPU clusters. However, existing solutions operate at a coarse granularity and are tightly coupled to specific model architectures, leaving much room for performance improvement. This paper presents Tessera, the first kernel disaggregation system to improve performance and cost efficiency on heterogeneous GPUs for large model inference. Our key insight is that kernels within a single application exhibit diverse resource demands, making them the most suitable granularity for aligning computation with hardware capabilities. Tessera integrates offline analysis with online adaptation by extracting precise inter-kernel dependencies from PTX to ensure correctness, overlapping communication with computation through a pipelined execution model, and employing workload-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.