Kitsune: Enabling Dataflow Execution on GPUs

Michael Davies; Neal Crago; Karthikeyan Sankaralingam; Stephen W.; Keckler

arXiv:2502.18403·cs.AR·February 26, 2025

Kitsune: Enabling Dataflow Execution on GPUs

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Stephen W., Keckler

PDF

Open Access

TL;DR

Kitsune introduces GPU architectural adjustments and a compiler to enable efficient dataflow execution, improving performance and reducing off-chip traffic for deep learning workloads without redesigning the hardware.

Contribution

The paper presents Kitsune, a novel set of primitives and a compiler that facilitate dataflow execution on GPUs, addressing limitations of traditional bulk-synchronous models.

Findings

01

Achieves 1.3×-2.3× performance improvement on challenge applications.

02

Reduces off-chip traffic by up to 98% during inference.

03

Provides 1.1×-2.4× performance gains during training.

Abstract

State of art DL models are growing in size and complexity, with many modern models also increasing in heterogeneity of behavior. GPUs are still the dominant platform for DL applications, relying on a bulk-synchronous execution model which has many drawbacks and is ill-suited for the graph structure of DL applications. Many industry and academic works attempt to overcome these by employing vertical fusion but this approach still fails to realize three untapped opportunities: (1) the fact that many resources on the GPU are idle while only one operator executes due to temporal multiplexing of the SM; (2) lower energy from more intelligent on-chip data-movement which lends to higher performance in a power-provisioned environment. (3) inability to exploit hidden or reduction dimensions as a source of parallelism to ease pressure on batch size. This paper explores relatively uncharted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques