On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
John Mendon\c{c}a, Alon Lavie, Isabel Trancoso

TL;DR
This paper critically examines current benchmarks for open-domain dialogue evaluation, revealing that they rely on outdated datasets and fail to accurately assess modern chatbot capabilities, especially in detecting deficiencies.
Contribution
It highlights the limitations of existing evaluation benchmarks and provides empirical evidence that current LLM evaluators struggle to identify issues in recent chatbot dialogues.
Findings
Existing benchmarks use outdated datasets and metrics.
LLM evaluators like GPT-4 have difficulty detecting deficiencies.
Current evaluation methods do not fully reflect modern chatbot capabilities.
Abstract
Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Dropout
