Accelerating Language Model Workflows with Prompt Choreography
TJ Bai, Jason Eisner

TL;DR
Prompt Choreography is a framework that accelerates large language model workflows by using a dynamic cache, enabling faster execution and supporting parallel calls, with fine-tuning to maintain accuracy.
Contribution
The paper introduces Prompt Choreography, a novel framework that efficiently manages LLM workflows through caching and reordering, reducing latency and enabling parallel processing.
Findings
Significantly reduces per-message latency (2.0--6.2× faster).
Achieves over 2.2× end-to-end speedup in redundant workflows.
Fine-tuning helps maintain accuracy despite caching.
Abstract
Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2 faster time-to-first-token) and achieves substantial end-to-end speedups (2.2) in some workflows dominated by redundant computation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
