Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

Ryo Kamoi; Ameya Godbole; Longqi Yang; Rui Zhang; Mengting Wan; Pei Zhou

arXiv:2603.17094·cs.CL·March 19, 2026

Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

Ryo Kamoi, Ameya Godbole, Longqi Yang, Rui Zhang, Mengting Wan, Pei Zhou

PDF

Open Access

TL;DR

This paper introduces CoCoEval, a framework for analyzing inconsistent and uncollaborative behaviors in LLM-simulated conversations, revealing challenges in accurately modeling human social interactions.

Contribution

The work presents CoCoEval, a novel evaluation method for turn-level behavior detection in LLM conversations, and provides insights into the limitations of current prompting and fine-tuning techniques.

Findings

01

LLM-simulated conversations show fewer problematic behaviors than humans.

02

Prompt engineering does not reliably control these behaviors.

03

Fine-tuning can cause overproduction of certain behaviors.

Abstract

Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · AI in Service Interactions