Simple LLM Prompting is State-of-the-Art for Robust and Multilingual   Dialogue Evaluation

John Mendon\c{c}a; Patr\'icia Pereira; Helena Moniz; Jo\~ao Paulo; Carvalho; Alon Lavie; Isabel Trancoso

arXiv:2308.16797·cs.CL·September 11, 2023·6 cites

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

John Mendon\c{c}a, Patr\'icia Pereira, Helena Moniz, Jo\~ao Paulo, Carvalho, Alon Lavie, Isabel Trancoso

PDF

Open Access 1 Repo

TL;DR

This paper introduces a prompt-based framework leveraging Large Language Models to improve robustness and multilingual capabilities in automatic dialogue evaluation, achieving state-of-the-art results across multiple benchmarks.

Contribution

It presents a novel prompting-based approach that enhances dialogue evaluation metrics' robustness and multilinguality, outperforming existing methods.

Findings

01

Achieves top Spearman correlation scores on multiple benchmarks.

02

Ranks first on DSTC11 Robust and Multilingual tasks.

03

Demonstrates the effectiveness of prompted LLMs for dialogue evaluation.

Abstract

Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

johndmendonca/dialevalml
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques