Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark
Junjie Shi, Wei Ma, Shi Ying, Lingxiao Jiang, Yang liu, Bo Du

TL;DR
This paper introduces BIS, a prompt-centric importance sampling framework that predicts large language model performance on new code generation benchmarks without executing code, reducing costs and contamination risks.
Contribution
BIS leverages importance sampling and autoencoders to estimate LLM performance from existing benchmarks, enabling reliable, ground-truth-free evaluation on unseen tasks.
Findings
Achieves 1.1% average error in code correctness prediction
Generalizes well to other metrics like pass@1 with 2.15% error
Reduces benchmarking costs and contamination risks significantly
Abstract
With the rapid advancement of large language models , code generation has become a key benchmark for evaluating LLM capabilities. However, existing benchmarks face two major challenges: (1) the escalating cost of constructing high-quality test suites and reference solutions, and (2) the increasing risk of data contamination, which undermines the reliability of benchmark-based evaluations. In this paper, we propose BIS, a prompt-centric evaluation framework that enables ground-truth-free prediction of LLM performance on code generation tasks. Rather than executing generated code, BIS estimates performance metrics by analyzing the prompt distribution alone. Built on importance sampling theory and implemented using Importance Weighted Autoencoders, our method reweights samples from existing annotated benchmarks to estimate performance on new, unseen benchmarks. To stabilize the estimation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Natural Language Processing Techniques
