CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu; Hongwei Liu; Junnan Liu; Linchen Xiao; Songyang Gao; Chengqi Lyu; Yuzhe Gu; Wenwei Zhang; Derek F. Wong; Songyang Zhang; Kai Chen

arXiv:2508.03686·cs.CL·August 6, 2025

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen

PDF

3 Models 1 Datasets 1 Video

TL;DR

CompassVerifier is a new lightweight, robust verifier model that accurately evaluates LLM outputs across multiple domains and answer types, addressing limitations of current methods and providing a comprehensive benchmark for verification capabilities.

Contribution

The paper introduces CompassVerifier, a versatile verifier model with enhanced robustness and generalizability, along with VerifierBench, a comprehensive benchmark for LLM answer verification.

Findings

01

Demonstrates multi-domain competency including math and reasoning tasks.

02

Effectively identifies abnormal or invalid responses.

03

Supports various answer formats like formulas and multi-subproblems.

Abstract

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

opencompass/VerifierBench
dataset· 86 dl
86 dl

Videos

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward· underline