A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?
Tingwei Chen, Kaiyang Huang, Kai Sun

TL;DR
This paper benchmarks three large language models on power flow computation tasks, revealing that prompt structure significantly impacts accuracy and that none are yet reliable enough for direct numerical solving.
Contribution
It provides a controlled comparison of LLMs with various prompt formats on power flow problems, highlighting the effects of prompt design on model accuracy.
Findings
Gemini 2.5 Pro performs best with simple prompts
Structured JSON prompts increase error significantly
GPT-3.5 Turbo fails on most cases across formats
Abstract
We present a controlled benchmark evaluating three LLMs -- Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-3.5 Turbo -- across four prompt formats (from concise narrative to structured JSON with explicit iteration trace) on Gauss--Seidel AC power flow computation for a three-bus system. Against 50 test cases with reference solutions computed numerically, Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error (MAE = 0.257 MW/MVar, 54\% of cases within 5\% relative error), while the same model with a JSON-structured prompt raises MAE to 0.789 -- a 3.1 increase. Adding a worked example degrades accuracy for Gemini but provides a marginal gain for Claude. GPT-3.5 Turbo fails on at least 90\% of cases under all prompt formats. An independent 100-case replication with related prompt-format families confirms the qualitative ordering (Gemini Claude …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
