A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?

Tingwei Chen; Kaiyang Huang; Kai Sun

arXiv:2605.18642·eess.SY·May 19, 2026

A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?

Tingwei Chen, Kaiyang Huang, Kai Sun

PDF

TL;DR

This paper benchmarks three large language models on power flow computation tasks, revealing that prompt structure significantly impacts accuracy and that none are yet reliable enough for direct numerical solving.

Contribution

It provides a controlled comparison of LLMs with various prompt formats on power flow problems, highlighting the effects of prompt design on model accuracy.

Findings

01

Gemini 2.5 Pro performs best with simple prompts

02

Structured JSON prompts increase error significantly

03

GPT-3.5 Turbo fails on most cases across formats

Abstract

We present a controlled benchmark evaluating three LLMs -- Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-3.5 Turbo -- across four prompt formats (from concise narrative to structured JSON with explicit iteration trace) on Gauss--Seidel AC power flow computation for a three-bus system. Against 50 test cases with reference solutions computed numerically, Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error (MAE = 0.257 MW/MVar, 54\% of cases within 5\% relative error), while the same model with a JSON-structured prompt raises MAE to 0.789 -- a 3.1 $\times$ increase. Adding a worked example degrades accuracy for Gemini but provides a marginal gain for Claude. GPT-3.5 Turbo fails on at least 90\% of cases under all prompt formats. An independent 100-case replication with related prompt-format families confirms the qualitative ordering (Gemini $>$ Claude $>$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.