Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, Kyomin Jung

TL;DR
This paper introduces EMBER, a benchmark to evaluate how robust LLM-based judges are when assessing outputs containing epistemic markers, revealing a bias against uncertainty expressions and highlighting a robustness issue.
Contribution
The study presents EMBER, the first benchmark for testing LLM-judges' robustness to epistemic markers, and demonstrates their vulnerability to bias caused by such markers.
Findings
LLM-judges show bias against epistemic markers.
All tested LLM-judges are affected by epistemic markers.
Bias is stronger against markers expressing uncertainty.
Abstract
In line with the principle of honesty, there has been a growing effort to train large language models (LLMs) to generate outputs containing epistemic markers. However, evaluation in the presence of epistemic markers has been largely overlooked, raising a critical question: Could the use of epistemic markers in LLM-generated outputs lead to unintended negative consequences? To address this, we present EMBER, a benchmark designed to assess the robustness of LLM-judges to epistemic markers in both single and pairwise evaluation settings. Our findings, based on evaluations using EMBER, reveal that all tested LLM-judges, including GPT-4o, show a notable lack of robustness in the presence of epistemic markers. Specifically, we observe a negative bias toward epistemic markers, with a stronger bias against markers expressing uncertainty. This suggests that LLM-judges are influenced by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Artificial Intelligence in Law
MethodsFocus
