Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
Jiyoon Myung

TL;DR
This paper systematically evaluates the reliability of large language models in multi-turn conversations, revealing significant performance declines and failure modes that impact trustworthy deployment.
Contribution
It introduces a comprehensive evaluation framework for assessing conversational reliability of LLMs across multiple practical tasks and highlights key failure modes.
Findings
Reliability declines significantly in multi-turn settings, especially for smaller models.
Common failure modes include instruction drift, intent confusion, and contextual overwriting.
The study emphasizes the importance of stress-testing LLMs for conversational robustness.
Abstract
Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Software System Performance and Reliability · Adversarial Robustness in Machine Learning
