Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tiancheng Hu, Jin Qin, Zheng Wang, Junhao Hu, Yuzheng Wang, Lei Chen, Yizhou Shan, Mingxing Zhang, Ting Cao, Chunwei Xia, Huimin Cui, Tao Xie, Chenxi Wang

TL;DR
Tessera is a novel kernel disaggregation system that enhances performance and cost efficiency for large model inference on heterogeneous GPUs by aligning kernels with hardware capabilities.
Contribution
It introduces the first kernel-level disaggregation approach that adapts to diverse resource demands within applications, improving efficiency over existing coarse-grained methods.
Findings
Up to 2.3x increase in serving throughput.
Up to 1.6x improvement in cost efficiency.
Heterogeneous GPU pairs can outperform homogeneous high-end GPU setups.
Abstract
Disaggregation maps parts of an AI workload to different types of GPUs, offering a path to utilize modern heterogeneous GPU clusters. However, existing solutions operate at a coarse granularity and are tightly coupled to specific model architectures, leaving much room for performance improvement. This paper presents Tessera, the first kernel disaggregation system to improve performance and cost efficiency on heterogeneous GPUs for large model inference. Our key insight is that kernels within a single application exhibit diverse resource demands, making them the most suitable granularity for aligning computation with hardware capabilities. Tessera integrates offline analysis with online adaptation by extracting precise inter-kernel dependencies from PTX to ensure correctness, overlapping communication with computation through a pipelined execution model, and employing workload-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
