Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use
Julien Delavande, Regis Pierrard, Sasha Luccioni

TL;DR
This paper investigates how system-level design choices such as quantization, batching, and request scheduling significantly influence the energy efficiency of large language model inference, emphasizing the importance of holistic deployment strategies.
Contribution
It provides a comprehensive empirical analysis of energy and latency impacts of various system-level strategies in LLM inference on NVIDIA H100 GPUs, highlighting practical optimization insights.
Findings
Lower-precision formats only improve energy efficiency in compute-bound regimes.
Batching enhances energy efficiency, especially during decoding phases.
Structured request timing can reduce per-request energy consumption by up to 100 times.
Abstract
Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face's Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper tells a clear story: if you want to cut inference energy for LLMs, you must reason by phase. By decomposing generation into prefill and decode and normalizing energy by input and output tokens, the authors reveal that these phases live in different regimes (compute vs memory). This framing is simple yet novel, and it immediately explains why a single knob can help in one phase and hurt in the other. It also shifts the conversation from model-only tweaks to the full serving stack, showi
1. Serving-stack breadth is limited. Results highlight TGI, but there is no controlled comparison vs vLLM or TensorRT-LLM under the same traffic. 2. System scope is GPU-centric, not resource-limited. The study focuses on high-end GPUs (H100) and GPU energy, with little accounting for CPU, or constrained devices. 3. Sequence-length coverage. Sequence length directly shapes which phase dominates (prefill vs decode), how much padding waste you incur, how large and active the KV-cache becomes, and
- This submission conducts a timely and interesting analysis, on the increasingly important topic of quantifying the energy footprint of AI deployment, through the unique angle of incorporating system-level design aspects in the equation. -The results are presented with appropriate breakdowns and normalisations to allow drawing clear conclusions in the examined setup. - Many of the presented findings are insightful (with some being more expected than others), and the results presented can act a
- The main drawback of the presented methodology is the fact that energy consumption analysis is solely focused on GPU operations, ignoring memory transfers to/from GPU memory. This aspect often has decisive impact in the power draw, as well as is notably affected by some of the examined angles such as quantisation. As such, the practical applicability of the reported findings is limited, since the analysis does not seem to capture the whole picture. - The presented analysis (e.g. on batching)
1. This paper addresses an important problem—the efficiency and energy consumption of LLM inference—which has become increasingly critical as model sizes continue to grow. 2. The authors examine this issue from multiple perspectives, including **quantization**, **batching**, and **serving strategies**, all of which are highly relevant for practical deployment and provide valuable insights for practitioners seeking to better understand and optimize energy use.
1. The contributions of this work are rather limited, as most of the findings and conclusions are already known in prior literature. 2. The experimental scope is not comprehensive. While some limitations—such as prompt and output diversity and hardware variety—are acknowledged, these aspects are crucial for a complete understanding of inference efficiency and should not be deferred. Moreover, key factors such as evaluations on larger models, studies on Mixture-of-Experts (MoE) architectures, and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Machine Learning in Materials Science · Green IT and Sustainability
