AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing   and Data Locality

Ilias Bournias; Lukas Cavigelli; Georgios Zacharopoulos

arXiv:2411.05555·cs.DC·November 11, 2024

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos

PDF

Open Access

TL;DR

AcceLLM introduces a redundancy-based approach to improve large language model inference by balancing load and optimizing data locality, significantly reducing latency and increasing efficiency in cloud GPU systems.

Contribution

It presents a novel method leveraging data redundancy to enhance load balancing and data locality for LLM inference, addressing latency and resource utilization issues.

Findings

01

Up to 30% reduction in latency and efficiency improvements over state-of-the-art systems.

02

Effective handling of diverse workloads with improved hardware utilization.

03

Validated on Nvidia H100 GPU and Huawei Ascend 910B2.

Abstract

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive optimizations for optimal performance. Current systems batch prefill and decoding to boost throughput but encounter latency issues, while others disaggregate these phases, leading to resource underutilization. We propose AcceLLM, a novel method addressing latency and load balancing, inspired by the cache data management. It strategically utilizes redundant data to enhance inference via load balancing and optimal hardware use. Simulated evaluations on Nvidia H100 GPU and Huawei Ascend 910B2 show AcceLLM surpasses state-of-the-art systems up to 30% in latency and efficiency, handling diverse workloads effectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Power Systems and Technologies · Advanced Data Storage Technologies