Task Scheduling for Efficient Inference of Large Language Models on   Single Moderate GPU Systems

Wenxiang Lin; Xinglin Pan; Shaohuai Shi; Xuan Wang; Xiaowen Chu

arXiv:2411.15715·cs.CE·November 26, 2024

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Xiaowen Chu

PDF

Open Access

TL;DR

This paper presents extsc{Task}, a novel inference engine that accelerates large language model inference on moderate GPUs by optimizing model partitioning, task scheduling, and communication, achieving significant speedups without accuracy loss.

Contribution

Introduces extsc{Task}, a high-performance inference engine with innovative partitioning, adaptive algorithms, and token strategies for efficient LLM inference on moderate GPU systems.

Findings

01

Achieves 1.11x to 1.80x faster decoding speeds.

02

Attains 1.69x to 6.33x faster pre-filling speeds.

03

Overall speedup of 1.25x to 2.04x over existing solutions.

Abstract

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can shrink model sizes but often impair accuracy, making them unsuitable for practical applications. In this work, we introduce \modelname{}, a high-performance inference engine designed to speed up LLM inference without compromising model accuracy. \modelname{} incorporates three innovative methods to increase inference efficiency: 1) model partitioning to allow asynchronous processing of tasks across CPU computation, GPU computation, and CPU-GPU communication, 2) an adaptive partition algorithm to optimize the use of CPU, GPU, and PCIe communication capabilities, and 3) a token assignment strategy to handle diverse prompt and generation tasks during LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques