Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
Md. Monzurul Amin Ifath, Israat Haque

TL;DR
This paper systematically analyzes the performance and energy trade-offs in multi-request large language model workflows, revealing key factors like batch size and power capping that influence efficiency.
Contribution
It introduces the first comprehensive characterization of energy-performance trade-offs in multi-request LLM inference, using representative workloads and real hardware measurements.
Findings
Batch size significantly impacts latency and energy efficiency, workload dependent.
GPU power capping yields modest energy savings with predictable effects.
Engine-level optimizations and workflow-aware scheduling improve efficiency under constraints.
Abstract
Large language models (LLMs) are increasingly used in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored. To address these gaps, this paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. We develop four representative workloads capturing sequential, interactive, agentic, and composite patterns common in modern deployments.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
