Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark

Junjie Shi; Wei Ma; Shi Ying; Lingxiao Jiang; Yang liu; Bo Du

arXiv:2508.01203·cs.AI·August 5, 2025

Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark

Junjie Shi, Wei Ma, Shi Ying, Lingxiao Jiang, Yang liu, Bo Du

PDF

Open Access

TL;DR

This paper introduces BIS, a prompt-centric importance sampling framework that predicts large language model performance on new code generation benchmarks without executing code, reducing costs and contamination risks.

Contribution

BIS leverages importance sampling and autoencoders to estimate LLM performance from existing benchmarks, enabling reliable, ground-truth-free evaluation on unseen tasks.

Findings

01

Achieves 1.1% average error in code correctness prediction

02

Generalizes well to other metrics like pass@1 with 2.15% error

03

Reduces benchmarking costs and contamination risks significantly

Abstract

With the rapid advancement of large language models , code generation has become a key benchmark for evaluating LLM capabilities. However, existing benchmarks face two major challenges: (1) the escalating cost of constructing high-quality test suites and reference solutions, and (2) the increasing risk of data contamination, which undermines the reliability of benchmark-based evaluations. In this paper, we propose BIS, a prompt-centric evaluation framework that enables ground-truth-free prediction of LLM performance on code generation tasks. Rather than executing generated code, BIS estimates performance metrics by analyzing the prompt distribution alone. Built on importance sampling theory and implemented using Importance Weighted Autoencoders, our method reweights samples from existing annotated benchmarks to estimate performance on new, unseen benchmarks. To stabilize the estimation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Natural Language Processing Techniques