LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference   Serving at Scale

Jaehong Cho; Minsu Kim; Hyunmin Choi; Guseul Heo; Jongse Park

arXiv:2408.05499·cs.DC·December 2, 2024

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park

PDF

Open Access 1 Repo

TL;DR

LLMServingSim is a co-simulation tool that accurately models hardware-software interactions in large language model serving, enabling faster research without extensive simulation time.

Contribution

This paper introduces LLMServingSim, a novel simulation infrastructure that accounts for workload dynamics and computation redundancies in LLM inference serving systems.

Findings

01

Achieves less than 14.7% error compared to real GPU systems

02

Offers 91.5x faster simulation speed than existing simulators

03

Supports flexible integration of various accelerator stacks

Abstract

Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

casys-kaist/llmservingsim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computational Techniques and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus