Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Aleksander Ficek; Somshubra Majumdar; Vahid Noroozi; Boris Ginsburg

arXiv:2502.13820·cs.AI·July 31, 2025

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new evaluation framework for synthetic verification methods in code and reasoning tasks, including four benchmarks and multiple metrics, demonstrating that reasoning and scaling improve verification accuracy.

Contribution

It transforms existing benchmarks into scoring datasets, proposes new metrics, and releases four new benchmarks to evaluate synthetic verifiers for LLMs.

Findings

01

Reasoning improves test case generation.

02

Scaling test cases enhances verification accuracy.

03

Synthetic verifiers' effectiveness varies across methods.

Abstract

Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nvidia/Scoring-Verifiers
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Semantic Web and Ontologies