MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators
John Mendon\c{c}a, Alon Lavie, Isabel Trancoso

TL;DR
MEDAL introduces a comprehensive, multilingual benchmarking framework for evaluating open-domain chatbots using multiple LLMs, revealing limitations in current evaluators' ability to detect nuanced dialogue issues.
Contribution
The paper presents MEDAL, a novel automated multi-agent framework that creates diverse multilingual dialogue benchmarks and assesses LLM evaluators' effectiveness.
Findings
State-of-the-art judges struggle with nuanced issues like empathy and relevance.
MEDAL uncovers cross-lingual performance differences in chatbot evaluation.
Benchmarking reveals current evaluators' limitations in nuanced judgment detection.
Abstract
Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare · linguistics and terminology studies
