Benchmarking and In-depth Performance Study of Large Language Models on   Habana Gaudi Processors

Chengming Zhang; Baixi Sun; Xiaodong Yu; Zhen Xie; Weijian Zheng,; Kamil Iskra; Pete Beckman; Dingwen Tao

arXiv:2309.16976·cs.LG·October 2, 2023

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Chengming Zhang, Baixi Sun, Xiaodong Yu, Zhen Xie, Weijian Zheng,, Kamil Iskra, Pete Beckman, Dingwen Tao

PDF

Open Access

TL;DR

This paper systematically evaluates the performance of Transformer-based large language models on Habana GAUDI processors, providing insights and optimization strategies for efficient deployment on specialized AI hardware.

Contribution

It offers a comprehensive performance comparison, optimization strategies, and end-to-end evaluation of Transformers on GAUDI, addressing a research gap in hardware acceleration.

Findings

01

GAUDI's MME and TPC have distinct strengths and weaknesses

02

Optimization strategies improve Transformer performance on GAUDI

03

End-to-end LLM performance on GAUDI reveals key bottlenecks

Abstract

Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic complexity of the self-attention mechanism further exacerbates these challenges when dealing with long sequences and large datasets. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. Firstly, we provide a comprehensive performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Secondly, we explore strategies to optimize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Data Storage Technologies