Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors
Chengming Zhang, Baixi Sun, Xiaodong Yu, Zhen Xie, Weijian Zheng,, Kamil Iskra, Pete Beckman, Dingwen Tao

TL;DR
This paper systematically evaluates the performance of Transformer-based large language models on Habana GAUDI processors, providing insights and optimization strategies for efficient deployment on specialized AI hardware.
Contribution
It offers a comprehensive performance comparison, optimization strategies, and end-to-end evaluation of Transformers on GAUDI, addressing a research gap in hardware acceleration.
Findings
GAUDI's MME and TPC have distinct strengths and weaknesses
Optimization strategies improve Transformer performance on GAUDI
End-to-end LLM performance on GAUDI reveals key bottlenecks
Abstract
Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic complexity of the self-attention mechanism further exacerbates these challenges when dealing with long sequences and large datasets. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. Firstly, we provide a comprehensive performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Secondly, we explore strategies to optimize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Data Storage Technologies
