FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
Sina Heidari, Dimitrios S. Nikolopoulos

TL;DR
FACT introduces a three-stage agentic workflow that synthesizes optimized GPU kernels from PyTorch modules, combining pattern discovery, realization, and composition to outperform existing libraries and baselines.
Contribution
The paper presents a novel agent-driven framework that automates kernel synthesis and optimization, integrating pattern discovery, auto-tuning, and composition grounded in CUTLASS C++.
Findings
Achieves 1.06x-1.18x speedups on NVIDIA A100 for GEMM problems.
Attains 2.03x speedup on MiniGPT transformer blocks over PyTorch eager baseline.
Demonstrates practical effectiveness across diverse GPU architectures and workloads.
Abstract
Deep learning compilers and vendor libraries deliver strong baseline performance but their performance is bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. We present FACT (Framework for Agentic CUTLASS Transpilation), a three-stage agent-driven workflow optimizing PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. Pattern discovery inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples, and outputs prioritized patterns. Pattern realization implements each pattern as a CUTLASS kernel, verifies, and auto-tunes.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
