Measuring the Robustness of Reference-Free Dialogue Evaluation Systems

Justin Vasselli; Adam Nohejl; Taro Watanabe

arXiv:2501.06728·cs.CL·January 14, 2025

Measuring the Robustness of Reference-Free Dialogue Evaluation Systems

Justin Vasselli, Adam Nohejl, Taro Watanabe

PDF

1 Repo

TL;DR

This paper introduces a benchmark to evaluate the robustness of reference-free dialogue evaluation metrics against adversarial attacks, revealing discrepancies between traditional performance and vulnerability to adversarial inputs.

Contribution

It provides a systematic framework for assessing the robustness of dialogue metrics and highlights the need for more nuanced evaluation methods.

Findings

01

Metrics vary in robustness to adversarial attacks.

02

Traditional benchmarks may not reflect real-world vulnerabilities.

03

Some metrics show weak correlation with human judgment under attack.

Abstract

Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval -- a prompt-based method leveraging LLMs -- across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jvasselli/dialogue-metric-robustness
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.