Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Yechen Xu; Xinhao Kong; Tingjun Chen; Danyang Zhuo

arXiv:2406.00059·cs.CL·June 6, 2024

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo

PDF

Open Access 1 Repo 4 Reviews

TL;DR

Conveyor is a system that enhances large language model serving efficiency by enabling partial execution of external tools during decoding, significantly reducing request latency.

Contribution

We introduce Conveyor, a novel system that allows tool partial execution in LLM serving, optimizing performance for tool-involving requests.

Findings

01

Request latency reduced by up to 38.8%

02

New interface for tool partial execution exposed to developers

03

Efficient handling of external tool invocation in LLM serving

Abstract

The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

The concept of separating text generation from tool execution and running them in parallel is interesting. The background introduction to the key concepts in LLM serving and the tool execution workflow is correct. The paper involves some engineering effort in prompt design.

Weaknesses

The paper makes a strong and somewhat unrealistic assumption. Based on the illustrative examples (Figure 4), theoretical analysis (Section 3.4), and evaluation (Section 4), it seems the authors implicitly assume that each request triggers only one tool execution and does so only once. This oversimplification deviates significantly from real-world workloads. In Section 3.4, the authors provide theoretical lower and upper bounds for their proposed parallel scheduling approach. However, these boun

Reviewer 02Rating 3Confidence 4

Strengths

-The paper tackles challenges associated with augmented LLMs, advancing the development of compound AI systems. -The paper provides a comprehensive breakdown of the workflow for LLMs with external tool augmentation, thoroughly explaining each design component. -Evaluation covers diverse workloads—code generation, search, planning, and validation—demonstrating Conveyor’s performance across various scenarios.

Weaknesses

- The impact of the contribution is limited by its reliance on specific types of external tool calls and workload characteristics. The optimization benefits only long, independent tool calls, raising questions about its broad applicability. Additionally, the paper does not rigorously analyze the potential decoding overhead. - Conveyor could potentially increase latency in cases where its overhead outweighs the benefits. Presenting these cases would add value, and a hybrid approach that dynamica

Reviewer 03Rating 3Confidence 4

Strengths

(1) The writing of the paper is good. (2) This paper proposes a method addressing a problem for which satisfactory solutions are currently lacking and offers a reference for future research.

Weaknesses

(1) The related work is insufficient and does not demonstrate the advantages and differences of this work over prior studies. In the related work section(L94), the paper lacks an introduction to studies where researchers recognize methods for improving the efficiency of LLM external tool utilization such as LLM-dCache[1] and APIServe[2]. (2) The author’s approach lacks innovation and appears rather straightforward. Moreover, the effectiveness of this method may be highly dependent on the specif

Reviewer 04Rating 5Confidence 4

Strengths

The paper touches on a very timely and important matter as inference optimization becomes increasingly important with broader adoption. The paper has the following strengths: - The experimental pipelines are well-chosen as I think they represent a good range of practical use cases. - The theoretical framework is intuitive.

Weaknesses

Score-relevant weaknesses: - Are the partial execution triggers learned or rule-based? Things like a newline are straightforward, but what about specific details like code delimiters that vary across programming languages? I understand it has to be passed with a tool, but isn't it impractical to define potentially 100s or 1000s of triggers? Wouldn't learning be more appropriate, especially since you already have the tokens available? I would appreciate more details and a more thorough evaluatio

Code & Models

Repositories

conveyor-sys/conveyor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Modular Robots and Swarm Intelligence · Industrial Automation and Control Systems