Reasoning Gets Harder for LLMs Inside A Dialogue
Ivan Kart\'a\v{c}, Mateusz Lango, Ond\v{r}ej Du\v{s}ek

TL;DR
This paper examines how reasoning performance of large language models deteriorates in dialogue-based settings compared to isolated tasks, highlighting the importance of realistic interactive evaluations.
Contribution
Introduces BOULDER, a new benchmark for reasoning in dialogue, and demonstrates the significant performance gap caused by multi-turn interactions in LLMs.
Findings
Performance drops significantly in dialogue-based reasoning tasks.
Multi-turn dialogue nature largely causes the performance gap.
Role conditioning and tool-use further impact reasoning accuracy.
Abstract
Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
