Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots
Ekaterina Svikhnushina, Pearl Pu

TL;DR
This study compares online and offline evaluation methods for social chatbots, finding that online assessments better capture user experience nuances, while automated third-party evaluations with GPT-4 can approximate human judgments.
Contribution
It systematically compares online and offline chatbot evaluations, demonstrating the limitations of offline methods and proposing automated GPT-4 evaluations as a promising alternative.
Findings
Offline evaluations miss interaction subtleties
GPT-4 based evaluations align better with human judgments
Online assessments provide richer insights into user experience
Abstract
This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Misinformation and Its Impacts · FinTech, Crowdfunding, Digital Finance
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer
