Predicting Performance of Symbolic and Prompt Programs with Examples
Chengqi Zheng, Keya Hu, Shuzhi Liu, Tao Wu, Kevin Ellis, Yewen Pu

TL;DR
This paper introduces a probabilistic model and a retrieval-based method called RAP for predicting the performance of symbolic and prompt programs on unseen tasks, addressing reliability issues in LLM prompting.
Contribution
It develops a simple Bernoulli-based performance prediction model and proposes RAP, a retrieval method to construct priors for better performance estimation.
Findings
Performance for symbolic programs is all or nothing, while prompt programs have a diffuse prior.
Few passing tests can certify symbolic programs but not prompt programs.
RAP achieves solid performance in predicting program success.
Abstract
LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
