The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute

Yunho Jin; Gu-Yeon Wei; David Brooks

arXiv:2505.14733·cs.LG·November 11, 2025

The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute

Yunho Jin, Gu-Yeon Wei, David Brooks

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how test-time compute (TTC) can improve the energy efficiency of large language models by allocating resources during inference, leading to better accuracy-energy trade-offs especially in complex reasoning tasks.

Contribution

It demonstrates that TTC can outperform traditional model scaling in energy efficiency and accuracy, and highlights the importance of adjusting compute based on query complexity.

Findings

01

TTC surpasses traditional scaling in accuracy/energy efficiency.

02

Adjusting compute based on output length improves efficiency.

03

TTC is especially beneficial for complex reasoning tasks.

Abstract

Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Authors provide a timely and useful set of explorations that many in the community may find helpful as rules of thumb and guidelines for their own work 2. I felt it was a fresh perspective, analyzing results through lenses of GPU utilization and instantaneous power draw vs overall energy consumption 3. Compelling discussions of some possibly counterintuitive results in 4.2 -- I actually felt these should be highlighted more

Weaknesses

1. I realize there is probably not a single setting that is clearly both a fair comparison and very realistic, but a batch size of 16 feels optimistic -- I would rather have seen multiple batch sizes explored given that models are often served in very different settings. I cannot help but imagine that smaller models with TTC beating larger models without TTC is at least somewhat a function of batch size, and it would be extremely helpful to understand what the practical threshold is, even just f

Reviewer 02Rating 6Confidence 3

Strengths

1. Good experimental design. Nice model sizes and divee set of methods/benchmarks make this papers' results more grounded. 2. Pretty practical and timely 3. Pretty novel insight that reasoning is more energy efficient.

Weaknesses

1. ONly studies qwen2.5 models, would have been good to see this across different model families. 2. A training analysis would have been nice toa dd, but I understadn that that is expensive. 3. Did not see standard error reported here? 4. Could ahve picked nonreasoning benchmarks as well to understand the landscape a bit better.

Reviewer 03Rating 10Confidence 4

Strengths

The paper provides empirical evidence for several important insights, notably: -- that merely increasing model size does not guarantee enhanced accuracy if underlying reasoning capabilities remain insufficient. -- that majority moving consumes orders of magnitude more energy than base models, with reasoning tokens consuming even more, notably because of how many tokens are produced -the fact that the RT inference process quickly becomes memory-bound due to the number of tokens produced , limit

Weaknesses

I feel like there are some of the analyses can be further deepened, especially with regards to potential hypotheses about why certain models or configurations use more energy than others. Also, the Figures are often not well explained and at a different place in the article than the text that talks about them, which can make it hard to follow the narrative of the paper. There are certain methodological details that are lacking -- e.g. regarding model selection and setup -- which could help to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Software Engineering Research · Natural Language Processing Techniques