An Inquiry into Datacenter TCO for LLM Inference with FP8

Jiwoo Kim; Joonhyung Lee; Gunho Park; Byeongwook Kim; Se Jung Kwon; Dongsoo Lee; Youngjoo Lee

arXiv:2502.01070·cs.LG·August 26, 2025

An Inquiry into Datacenter TCO for LLM Inference with FP8

Jiwoo Kim, Joonhyung Lee, Gunho Park, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee, Youngjoo Lee

PDF

Open Access

TL;DR

This paper analyzes the total cost of ownership for LLM inference in datacenters, emphasizing the impact of low-precision FP8 computation and hardware utilization, providing a framework for comparing AI accelerators.

Contribution

It introduces a generalizable TCO comparison framework for AI accelerators, focusing on FP8 quantization and workload characteristics, with empirical insights into hardware performance.

Findings

01

Gaudi HPUs outperform others on thin GEMMs in FP8 models

02

Memory-bound GEMV computations influence TCO more than hardware peak throughput

03

Empirical workload analysis is crucial for evaluating accelerator performance

Abstract

As large language models (LLMs) continue to scale, the high power consumption of AI accelerators in datacenters presents significant challenges, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs) that provide LLM inference. In this work, we analyze the computational characteristics of LLM inference from a TCO perspective and present a generalizable framework to compare AI accelerators across diverse operational requirements. Using this model, we investigate key workload characteristics influencing TCO for AI accelerators from Intel (Gaudi 2 & 3) and NVIDIA (H100 & H200), especially thin GEMM utilization and FP8 quantization. In particular, as FP8 emerges as the baseline precision for next-generation LLMs, understanding how different architectures implement and benefit from low-precision computation is increasingly critical. Throughput on thin…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMagnetic confinement fusion research · Particle accelerators and beam dynamics

Methodstravel james