Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo

TL;DR
Conveyor is a system that enhances large language model serving efficiency by enabling partial execution of external tools during decoding, significantly reducing request latency.
Contribution
We introduce Conveyor, a novel system that allows tool partial execution in LLM serving, optimizing performance for tool-involving requests.
Findings
Request latency reduced by up to 38.8%
New interface for tool partial execution exposed to developers
Efficient handling of external tool invocation in LLM serving
Abstract
The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The concept of separating text generation from tool execution and running them in parallel is interesting. The background introduction to the key concepts in LLM serving and the tool execution workflow is correct. The paper involves some engineering effort in prompt design.
The paper makes a strong and somewhat unrealistic assumption. Based on the illustrative examples (Figure 4), theoretical analysis (Section 3.4), and evaluation (Section 4), it seems the authors implicitly assume that each request triggers only one tool execution and does so only once. This oversimplification deviates significantly from real-world workloads. In Section 3.4, the authors provide theoretical lower and upper bounds for their proposed parallel scheduling approach. However, these boun
-The paper tackles challenges associated with augmented LLMs, advancing the development of compound AI systems. -The paper provides a comprehensive breakdown of the workflow for LLMs with external tool augmentation, thoroughly explaining each design component. -Evaluation covers diverse workloads—code generation, search, planning, and validation—demonstrating Conveyor’s performance across various scenarios.
- The impact of the contribution is limited by its reliance on specific types of external tool calls and workload characteristics. The optimization benefits only long, independent tool calls, raising questions about its broad applicability. Additionally, the paper does not rigorously analyze the potential decoding overhead. - Conveyor could potentially increase latency in cases where its overhead outweighs the benefits. Presenting these cases would add value, and a hybrid approach that dynamica
(1) The writing of the paper is good. (2) This paper proposes a method addressing a problem for which satisfactory solutions are currently lacking and offers a reference for future research.
(1) The related work is insufficient and does not demonstrate the advantages and differences of this work over prior studies. In the related work section(L94), the paper lacks an introduction to studies where researchers recognize methods for improving the efficiency of LLM external tool utilization such as LLM-dCache[1] and APIServe[2]. (2) The author’s approach lacks innovation and appears rather straightforward. Moreover, the effectiveness of this method may be highly dependent on the specif
The paper touches on a very timely and important matter as inference optimization becomes increasingly important with broader adoption. The paper has the following strengths: - The experimental pipelines are well-chosen as I think they represent a good range of practical use cases. - The theoretical framework is intuitive.
Score-relevant weaknesses: - Are the partial execution triggers learned or rule-based? Things like a newline are straightforward, but what about specific details like code delimiters that vary across programming languages? I understand it has to be passed with a tool, but isn't it impractical to define potentially 100s or 1000s of triggers? Wouldn't learning be more appropriate, especially since you already have the tokens available? I would appreciate more details and a more thorough evaluatio
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Modular Robots and Swarm Intelligence · Industrial Automation and Control Systems
