Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kart\'a\v{c}; Mateusz Lango; Ond\v{r}ej Du\v{s}ek

arXiv:2603.20133·cs.CL·April 30, 2026

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kart\'a\v{c}, Mateusz Lango, Ond\v{r}ej Du\v{s}ek

PDF

TL;DR

This paper examines how reasoning performance of large language models deteriorates in dialogue-based settings compared to isolated tasks, highlighting the importance of realistic interactive evaluations.

Contribution

Introduces BOULDER, a new benchmark for reasoning in dialogue, and demonstrates the significant performance gap caused by multi-turn interactions in LLMs.

Findings

01

Performance drops significantly in dialogue-based reasoning tasks.

02

Multi-turn dialogue nature largely causes the performance gap.

03

Role conditioning and tool-use further impact reasoning accuracy.

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.