BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Hippolyte Gisserot-Boukhlef; Nicolas Boizard; Emmanuel Malherbe; C\'eline Hudelot; Pierre Colombo

arXiv:2604.09497·cs.CL·April 13, 2026

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, C\'eline Hudelot, Pierre Colombo

PDF

1 Repo 8 Models 1 Datasets

TL;DR

This paper introduces BERT-as-a-Judge, a lightweight, encoder-based evaluation method for LLM outputs that outperforms lexical methods and matches larger LLM judges in accuracy, offering a scalable and reliable alternative.

Contribution

The paper presents BERT-as-a-Judge, a novel, efficient evaluation approach that improves semantic correctness assessment over lexical methods and rivals larger models.

Findings

01

Lexical evaluation correlates poorly with human judgments.

02

BERT-as-a-Judge outperforms lexical baselines in accuracy.

03

It matches larger LLM judges while being more computationally efficient.

Abstract

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

artefactory/BERT-as-a-Judge
github

Models

Datasets

artefactory/BERTJudge-Dataset
dataset· 24k dl
24k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.