LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

TL;DR
This paper reveals that large language models perform significantly worse in multi-turn conversations compared to single-turn settings, often getting lost and failing to recover from early mistakes, which impacts their usefulness in interactive tasks.
Contribution
The study provides large-scale empirical evidence showing performance drops in multi-turn LLM conversations and analyzes the causes of unreliability and assumption-driven errors.
Findings
Average 39% performance drop in multi-turn settings
Performance degradation due to unreliability and assumption errors
LLMs often get lost and fail to recover in extended conversations
Abstract
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes…
Peer Reviews
Decision·ICLR 2026 Oral
### Strength #1: Novel Decomposition of Performance into Aptitude and Reliability The paper makes an important conceptual contribution by decomposing overall performance degradation into two distinct components: aptitude (best-case capability, measured as 90th percentile performance, A90) and unreliability (variance across runs, measured as the 90-10 interpercentile range, U90/10). This framework reveals that the primary issue in multi-turn settings is not loss of capability but rather dramatic
### Weakness #1: Limited Real-World Validation Restricts Generalizability The paper’s central claim---that LLMs “get lost in multi-turn conversation”---is supported entirely through synthetic simulations in which single-turn benchmark instructions are artificially fragmented into minimal “shards” (typically 6–8 small facts revealed one per turn). While this setup is useful for controlled stress-testing, the authors do not demonstrate that real human–LLM conversations exhibit similar fragmentati
The paper's significance lies in highlighting the critical gap between LLM benchmarks (single-turn) and real-world use (multi-turn). Its novel methodology, "Sharded Simulation," provides a scalable and clever method to adapt existing benchmarks for multi-turn context evaluation. The robust experimentation, consisting of large-scale tests (15 LLMs, 6 tasks), provides strong evidence for the findings. The insightful analysis into "Aptitude" vs. "Unreliability" decomposition is a key insight, pi
The user simulation is a simplification of real, messy human interaction. The reliance on the simulation might weaken the research scope. Findings are based on analytical tasks; generalizability to creative or open-ended tasks is unclear, and should be considered in future works.
-A very clean experimental protocol: the sharded simulation is carefully constructed and well validated, and the suite of simulation modes (full, sharded, concat, recap, and snowball) is well designed to isolate where and why LLMs get lost in multi-turn conversations. Remarkably also, the authors repeat simulations for each instruction and quantify the resulting variability. -Large-scale evaluation across six tasks (code, databases, actions, math) with broad model coverage—from open-source to f
-It's a pity the root causes of conversational model failures are buried in the appendix. -I’m not fully convinced by the loss-of-middle-turns phenomenon (described in Appendix §F.3). This would probably require a deeper and per-model analysis.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
