BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, C\'eline Hudelot, Pierre Colombo

TL;DR
This paper introduces BERT-as-a-Judge, a lightweight, encoder-based evaluation method for LLM outputs that outperforms lexical methods and matches larger LLM judges in accuracy, offering a scalable and reliable alternative.
Contribution
The paper presents BERT-as-a-Judge, a novel, efficient evaluation approach that improves semantic correctness assessment over lexical methods and rivals larger models.
Findings
Lexical evaluation correlates poorly with human judgments.
BERT-as-a-Judge outperforms lexical baselines in accuracy.
It matches larger LLM judges while being more computationally efficient.
Abstract
Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗artefactory/BERTJudgemodel· 172 dl· ♡ 7172 dl♡ 7
- 🤗artefactory/BERTJudge-Formatted-QCR-500kmodel· 11 dl· ♡ 111 dl♡ 1
- 🤗artefactory/BERTJudge-Formatted-QCR-OODmodel· 8 dl8 dl
- 🤗artefactory/BERTJudge-Formatted-CRmodel· 9 dl9 dl
- 🤗artefactory/BERTJudge-Formatted-QCRmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗artefactory/BERTJudge-Formatted-QCR-100kmodel· 12 dl12 dl
- 🤗artefactory/BERTJudge-Formatted-QCR-200kmodel· 9 dl9 dl
- 🤗artefactory/BERTJudge-Free-CRmodel· 8 dl· ♡ 18 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
