
TL;DR
Blockbuster introduces a flexible framework for AI operator fusion across various hardware architectures, utilizing a graph-based workload representation and a novel rule-based fusion algorithm that models data movement between memory tiers.
Contribution
It presents a new rule-based fusion algorithm that explicitly models memory tier data movement, enabling powerful fusion of complex AI operations.
Findings
Successfully rediscovered Flash Attention kernel
Fused LayerNorm with matrix multiplication
Fused multiple operations into a single mega-kernel
Abstract
Blockbuster is a framework for AI operator fusion in inference programs. The Blockbuster framework is compatible with any multiprocessor architecture that has a tiered memory hierarchy, including GPUs, multi-core CPUs, and some AI accelerator chips. It includes a graph-based representation for AI workloads, called a block program, which explicitly models how blocks of data move between the memory tiers. It also includes an operator fusion procedure, which is made up of a candidate selection algorithm and a fusion algorithm that fuses each individual candidate - this two-algorithm structure makes Blockbuster especially suitable for large AI programs. The current paper focuses on the fusion algorithm, which is a rule-based technique. While the literature is full of previous rule-based fusion algorithms, what sets our algorithm apart is its direct modeling of data movement between memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Root Mean Square Layer Normalization
