Fast Cross-Operator Optimization of Attention Dataflow
Haodong Chang, Hailiang Hu, Zhenrui Wang, Yu Gong, Rongjian Liang, Zhexiang Tang, Bo Yuan, Jiang Hu

TL;DR
This paper introduces MMEE, an analytical optimization method for attention dataflow in transformers, significantly improving energy efficiency and latency while being much faster than prior approaches.
Contribution
The paper presents MMEE, a novel matrix-based enumeration approach for cross-operator dataflow optimization in attention computation, enabling faster and higher-quality solutions.
Findings
Reduces energy consumption by 48%-50%.
Achieves latency reduction of 31%-69%.
Runs 64x to 343x faster than previous methods.
Abstract
Attention is a fundamental computational kernel that accounts for the majority of the workload in transformer and LLM computing. Optimizing dataflow is crucial for enhancing both performance and energy efficiency in attention computation. This optimization involves a range of decisions, such as tiling, computation ordering and buffer management, and can be applied at both intra-operator and inter-operator levels, resulting in a highly complex decision space. We propose a new approach to cross-operator dataflow optimization. Its centerpiece is an analytical performance model that spans a large decision space and enables matrix-based encoding of multiple candidate solutions. Built on this foundation, a vast number of solutions can be evaluated rapidly, and with the aid of an effective pruning technique, the optimal solution can be identified through exhaustive enumeration. We refer to our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
