EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
Eliseo Curcio

TL;DR
EnergyAgentBench is a novel benchmark using live energy data to evaluate LLM agents on complex energy infrastructure tasks, enabling realistic assessment of their reasoning and decision-making capabilities.
Contribution
This paper introduces the first agentic benchmark grounded in live energy data, with diverse task variants and comprehensive evaluation of multiple LLMs.
Findings
Claude Sonnet 4.6 achieves highest overall score (0.900) at lower cost.
Claude Haiku 4.5 excels in long-horizon procedural siting (0.986).
F3 Causal family effectively discriminates model performance, with a 30.7-point score spread.
Abstract
Selecting the right electricity market region for a hyperscale AI datacenter requires reasoning across live electricity prices, grid carbon intensity, technology cost trajectories, and causal grid dynamics -- a multi-step, multi-source analytical task that static knowledge benchmarks cannot evaluate. We introduce EnergyAgentBench, the first agentic benchmark grounded in live electricity market data for this problem class. The benchmark comprises 70 task variants across five families: datacenter siting under cost-carbon trade-offs (F1), long-horizon portfolio siting (F1-LH), lifetime LCOE ranking over multi-decade cost trajectories (F2), 30-year portfolio optimization (F2-LH), and causal grid diagnosis (F3). Tasks require 3 to 48 sequential tool calls against live endpoints from the QuarluxAI infrastructure platform, the U.S. Energy Information Administration (EIA), and the National…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
