Dialogue You Can Trust: Human and AI Perspectives on Generated   Conversations

Ike Ebubechukwu; Johane Takeuchi; Antonello Ceravola; Frank Joublin

arXiv:2409.01808·cs.CL·September 11, 2024

Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations

Ike Ebubechukwu, Johane Takeuchi, Antonello Ceravola, Frank Joublin

PDF

Open Access

TL;DR

This paper compares human and AI evaluations of dialogue quality, showing GPT models closely match human judgments in some areas but still face challenges in reducing redundancy and contradictions.

Contribution

It provides a comprehensive analysis of GPT-4o's performance in dialogue evaluation, highlighting its strengths and limitations compared to human assessments.

Findings

01

GPT models align closely with human judgments in coherence and goal contribution

02

GPT-4o performs well in factual accuracy and commonsense reasoning

03

Challenges remain in reducing redundancy and self-contradiction in AI evaluations

Abstract

As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Ethics and Social Impacts of AI

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer · Residual Connection · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning · Multi-Head Attention · Byte Pair Encoding · Softmax