LLMs Get Lost In Multi-Turn Conversation

Philippe Laban; Hiroaki Hayashi; Yingbo Zhou; Jennifer Neville

arXiv:2505.06120·cs.CL·May 12, 2025·5 cites

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals that large language models perform significantly worse in multi-turn conversations compared to single-turn settings, often getting lost and failing to recover from early mistakes, which impacts their usefulness in interactive tasks.

Contribution

The study provides large-scale empirical evidence showing performance drops in multi-turn LLM conversations and analyzes the causes of unreliability and assumption-driven errors.

Findings

01

Average 39% performance drop in multi-turn settings

02

Performance degradation due to unreliability and assumption errors

03

LLMs often get lost and fail to recover in extended conversations

Abstract

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 3

Strengths

### Strength #1: Novel Decomposition of Performance into Aptitude and Reliability The paper makes an important conceptual contribution by decomposing overall performance degradation into two distinct components: aptitude (best-case capability, measured as 90th percentile performance, A90) and unreliability (variance across runs, measured as the 90-10 interpercentile range, U90/10). This framework reveals that the primary issue in multi-turn settings is not loss of capability but rather dramatic

Weaknesses

### Weakness #1: Limited Real-World Validation Restricts Generalizability The paper’s central claim---that LLMs “get lost in multi-turn conversation”---is supported entirely through synthetic simulations in which single-turn benchmark instructions are artificially fragmented into minimal “shards” (typically 6–8 small facts revealed one per turn). While this setup is useful for controlled stress-testing, the authors do not demonstrate that real human–LLM conversations exhibit similar fragmentati

Reviewer 02Rating 8Confidence 4

Strengths

The paper's significance lies in highlighting the critical gap between LLM benchmarks (single-turn) and real-world use (multi-turn). Its novel methodology, "Sharded Simulation," provides a scalable and clever method to adapt existing benchmarks for multi-turn context evaluation. The robust experimentation, consisting of large-scale tests (15 LLMs, 6 tasks), provides strong evidence for the findings. The insightful analysis into "Aptitude" vs. "Unreliability" decomposition is a key insight, pi

Weaknesses

The user simulation is a simplification of real, messy human interaction. The reliance on the simulation might weaken the research scope. Findings are based on analytical tasks; generalizability to creative or open-ended tasks is unclear, and should be considered in future works.

Reviewer 03Rating 10Confidence 5

Strengths

-A very clean experimental protocol: the sharded simulation is carefully constructed and well validated, and the suite of simulation modes (full, sharded, concat, recap, and snowball) is well designed to isolate where and why LLMs get lost in multi-turn conversations. Remarkably also, the authors repeat simulations for each instruction and quantify the resulting variability. -Large-scale evaluation across six tasks (code, databases, actions, math) with broad model coverage—from open-source to f

Weaknesses

-It's a pity the root causes of conversational model failures are buried in the appendix. -I’m not fully convinced by the loss-of-middle-turns phenomenon (described in Appendix §F.3). This would probably require a deeper and per-model analysis.

Code & Models

Repositories

microsoft/lost_in_conversation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification