Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the   effect of Epistemic Markers on LLM-based Evaluation

Dongryeol Lee; Yerin Hwang; Yongil Kim; Joonsuk Park; Kyomin Jung

arXiv:2410.20774·cs.CL·May 2, 2025

Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, Kyomin Jung

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces EMBER, a benchmark to evaluate how robust LLM-based judges are when assessing outputs containing epistemic markers, revealing a bias against uncertainty expressions and highlighting a robustness issue.

Contribution

The study presents EMBER, the first benchmark for testing LLM-judges' robustness to epistemic markers, and demonstrates their vulnerability to bias caused by such markers.

Findings

01

LLM-judges show bias against epistemic markers.

02

All tested LLM-judges are affected by epistemic markers.

03

Bias is stronger against markers expressing uncertainty.

Abstract

In line with the principle of honesty, there has been a growing effort to train large language models (LLMs) to generate outputs containing epistemic markers. However, evaluation in the presence of epistemic markers has been largely overlooked, raising a critical question: Could the use of epistemic markers in LLM-generated outputs lead to unintended negative consequences? To address this, we present EMBER, a benchmark designed to assess the robustness of LLM-judges to epistemic markers in both single and pairwise evaluation settings. Our findings, based on evaluations using EMBER, reveal that all tested LLM-judges, including GPT-4o, show a notable lack of robustness in the presence of epistemic markers. Specifically, we observe a negative bias toward epistemic markers, with a stronger bias against markers expressing uncertainty. This suggests that LLM-judges are influenced by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dongryeollee96/ember
pytorchOfficial

Datasets

Dongryeol/EMBER
dataset· 5 dl
5 dl

Videos

Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation· underline

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Artificial Intelligence in Law

MethodsFocus