An LLM-Tool Compiler for Fused Parallel Function Calling

Simranjit Singh; Andreas Karatzas; Michael Fore; Iraklis; Anagnostopoulos; Dimitrios Stamoulis

arXiv:2405.17438·cs.PL·May 29, 2024

An LLM-Tool Compiler for Fused Parallel Function Calling

Simranjit Singh, Andreas Karatzas, Michael Fore, Iraklis, Anagnostopoulos, Dimitrios Stamoulis

PDF

Open Access

TL;DR

This paper introduces LLM-Tool Compiler, a method that fuses similar tool operations to improve parallelization in large language models, significantly reducing latency and costs in complex API call tasks.

Contribution

It presents a novel compiler-based approach inspired by hardware principles to selectively fuse tool operations, enhancing parallelism and efficiency in LLM tool calling.

Findings

01

Achieves up to four times more parallel calls.

02

Reduces token costs by up to 40%.

03

Lowers latency by up to 12%.

Abstract

State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional prompting to segment tasks into multiple steps, each requiring a round-trip to the GPT APIs, leads to increased system latency and costs. Although recent advancements in parallel function calling have improved tool execution per API call, they may necessitate more detailed in-context instructions and task breakdown at the prompt level, resulting in higher engineering and production costs. Inspired by the hardware design principles of multiply-add (MAD) operations, which fuse multiple arithmetic operations into a single task from the compiler's perspective, we propose LLM-Tool Compiler, which selectively fuses similar types of tool operations under a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing