An Inquiry into Datacenter TCO for LLM Inference with FP8
Jiwoo Kim, Joonhyung Lee, Gunho Park, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee, Youngjoo Lee

TL;DR
This paper analyzes the total cost of ownership for LLM inference in datacenters, emphasizing the impact of low-precision FP8 computation and hardware utilization, providing a framework for comparing AI accelerators.
Contribution
It introduces a generalizable TCO comparison framework for AI accelerators, focusing on FP8 quantization and workload characteristics, with empirical insights into hardware performance.
Findings
Gaudi HPUs outperform others on thin GEMMs in FP8 models
Memory-bound GEMV computations influence TCO more than hardware peak throughput
Empirical workload analysis is crucial for evaluating accelerator performance
Abstract
As large language models (LLMs) continue to scale, the high power consumption of AI accelerators in datacenters presents significant challenges, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs) that provide LLM inference. In this work, we analyze the computational characteristics of LLM inference from a TCO perspective and present a generalizable framework to compare AI accelerators across diverse operational requirements. Using this model, we investigate key workload characteristics influencing TCO for AI accelerators from Intel (Gaudi 2 & 3) and NVIDIA (H100 & H200), especially thin GEMM utilization and FP8 quantization. In particular, as FP8 emerges as the baseline precision for next-generation LLMs, understanding how different architectures implement and benefit from low-precision computation is increasingly critical. Throughput on thin…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic confinement fusion research · Particle accelerators and beam dynamics
Methodstravel james
