Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of   Human Responses in Dialogue

Jonathan Ivey; Shivani Kumar; Jiayu Liu; Hua Shen and; Sushrita Rakshit; Rohan Raju; Haotian Zhang; Aparna; Ananthasubramaniam; Junghwan Kim; Bowen Yi; Dustin Wright and; Abraham Israeli; Anders Giovanni M{\o}ller; Lechen Zhang; David; Jurgens

arXiv:2409.08330·cs.CL·September 18, 2024·2 cites

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Jonathan Ivey, Shivani Kumar, Jiayu Liu, Hua Shen and, Sushrita Rakshit, Rohan Raju, Haotian Zhang, Aparna, Ananthasubramaniam, Junghwan Kim, Bowen Yi, Dustin Wright and, Abraham Israeli, Anders Giovanni M{\o}ller, Lechen Zhang, David, Jurgens

PDF

Open Access 1 Repo

TL;DR

This study evaluates whether large language models can accurately simulate human dialogue qualities by comparing generated LLM-LLM and human-LLM interactions, revealing significant divergence and language-specific performance patterns.

Contribution

The paper provides a large-scale analysis of LLM-generated dialogues versus human dialogues, highlighting the limitations and divergence in style and content across languages.

Findings

01

Low alignment between LLM simulations and human dialogues

02

Models perform similarly across English, Chinese, and Russian

03

LLMs better simulate human responses when writing style is similar to their own

Abstract

Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidjurgens/human-llm-similarity
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multi-Agent Systems and Negotiation

MethodsALIGN