Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis
Zhongchun Zhou, Yuhang Gu, Chengtao Lai, Ya Wang, Wei Zhang

TL;DR
This paper introduces Sim-FA, a GPGPU simulation framework for analyzing the FlashAttention pipeline, supporting new GPU features and providing accurate workload modeling for LLMs.
Contribution
It presents a cycle-accurate simulator integrated with FlashAttention-3, supporting recent GPU features and offering improved accuracy over existing models.
Findings
Simulator achieves 5.7% MAPE against H800.
Provides a theoretical analysis explaining inaccuracies in existing models.
Supports new NVIDIA GPU features like Tensor Memory Accelerator.
Abstract
To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer, as well as between matrix multiplication and activation function operations, substantially improving performance. To conduct effective AI infrastructure and computer architecture research, cycle-accurate simulators that support these new features, together with analytical models that faithfully capture workload characteristics, are essential. However, existing academic tools provide limited support for these emerging requirements. Existing cycle-accurate simulators do not incorporate new NVIDIA GPU features, such as the Tensor Memory Accelerator (TMA), in a timely manner. Moreover, existing analytical models can misestimate DRAM traffic under certain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
