Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations
Ike Ebubechukwu, Johane Takeuchi, Antonello Ceravola, Frank Joublin

TL;DR
This paper compares human and AI evaluations of dialogue quality, showing GPT models closely match human judgments in some areas but still face challenges in reducing redundancy and contradictions.
Contribution
It provides a comprehensive analysis of GPT-4o's performance in dialogue evaluation, highlighting its strengths and limitations compared to human assessments.
Findings
GPT models align closely with human judgments in coherence and goal contribution
GPT-4o performs well in factual accuracy and commonsense reasoning
Challenges remain in reducing redundancy and self-contradiction in AI evaluations
Abstract
As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Ethics and Social Impacts of AI
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer · Residual Connection · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning · Multi-Head Attention · Byte Pair Encoding · Softmax
