MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendon\c{c}a; Alon Lavie; Isabel Trancoso

arXiv:2505.22777·cs.CL·January 23, 2026

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendon\c{c}a, Alon Lavie, Isabel Trancoso

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MEDAL introduces a comprehensive, multilingual benchmarking framework for evaluating open-domain chatbots using multiple LLMs, revealing limitations in current evaluators' ability to detect nuanced dialogue issues.

Contribution

The paper presents MEDAL, a novel automated multi-agent framework that creates diverse multilingual dialogue benchmarks and assesses LLM evaluators' effectiveness.

Findings

01

State-of-the-art judges struggle with nuanced issues like empathy and relevance.

02

MEDAL uncovers cross-lingual performance differences in chatbot evaluation.

03

Benchmarking reveals current evaluators' limitations in nuanced judgment detection.

Abstract

Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

johndmendonca/medal
noneOfficial

Datasets

Johndfm/medal
dataset· 3 dl
3 dl

Videos

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators· underline

Taxonomy

TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare · linguistics and terminology studies