Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

TL;DR
This paper investigates the effectiveness of prompt optimization in compound AI systems, revealing it often resembles a coin flip and providing diagnostic tools to predict when it will be beneficial.
Contribution
It introduces a diagnostic framework with an ANOVA pre-test and a headroom test to determine when prompt optimization is likely to improve AI performance.
Findings
Optimization helps only with tasks having exploitable output structure
Interaction effects between prompts are statistically insignificant
The diagnostic tools can predict optimization success with high accuracy
Abstract
Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods 4 tasks 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant (, all ), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
