GFormer: Accelerating Large Language Models with Optimized Transformers   on Gaudi Processors

Chengming Zhang; Xinheng Ding; Baixi Sun; Xiaodong Yu; Weijian Zheng,; Zhen Xie; Dingwen Tao

arXiv:2412.19829·cs.AR·December 31, 2024

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

Chengming Zhang, Xinheng Ding, Baixi Sun, Xiaodong Yu, Weijian Zheng,, Zhen Xie, Dingwen Tao

PDF

Open Access

TL;DR

GFormer is a novel approach that optimizes Transformer-based large language models on Gaudi processors by integrating sparse and linear attention mechanisms, significantly enhancing efficiency and performance.

Contribution

It introduces GFormer, an integrated method combining sparse and linear attention to fully utilize Gaudi hardware for LLM inference, addressing prior optimization gaps.

Findings

01

GFormer achieves higher efficiency on Gaudi processors.

02

It outperforms state-of-the-art GPUs in LLM tasks.

03

Significant improvements in model performance and hardware utilization.

Abstract

Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices

MethodsAttention Is All You Need · Softmax