GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors
Chengming Zhang, Xinheng Ding, Baixi Sun, Xiaodong Yu, Weijian Zheng,, Zhen Xie, Dingwen Tao

TL;DR
GFormer is a novel approach that optimizes Transformer-based large language models on Gaudi processors by integrating sparse and linear attention mechanisms, significantly enhancing efficiency and performance.
Contribution
It introduces GFormer, an integrated method combining sparse and linear attention to fully utilize Gaudi hardware for LLM inference, addressing prior optimization gaps.
Findings
GFormer achieves higher efficiency on Gaudi processors.
It outperforms state-of-the-art GPUs in LLM tasks.
Significant improvements in model performance and hardware utilization.
Abstract
Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices
MethodsAttention Is All You Need · Softmax
