TL;DR
This paper introduces a benchmark to evaluate the robustness of reference-free dialogue evaluation metrics against adversarial attacks, revealing discrepancies between traditional performance and vulnerability to adversarial inputs.
Contribution
It provides a systematic framework for assessing the robustness of dialogue metrics and highlights the need for more nuanced evaluation methods.
Findings
Metrics vary in robustness to adversarial attacks.
Traditional benchmarks may not reflect real-world vulnerabilities.
Some metrics show weak correlation with human judgment under attack.
Abstract
Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval -- a prompt-based method leveraging LLMs -- across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
