Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning
Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

TL;DR
This paper introduces a new evaluation framework for synthetic verification methods in code and reasoning tasks, including four benchmarks and multiple metrics, demonstrating that reasoning and scaling improve verification accuracy.
Contribution
It transforms existing benchmarks into scoring datasets, proposes new metrics, and releases four new benchmarks to evaluate synthetic verifiers for LLMs.
Findings
Reasoning improves test case generation.
Scaling test cases enhances verification accuracy.
Synthetic verifiers' effectiveness varies across methods.
Abstract
Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Semantic Web and Ontologies
