Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ

Karl Neergaard; Le Qiu; Emmanuele Chersoni

arXiv:2601.16508·cs.CL·January 26, 2026

Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ

Karl Neergaard, Le Qiu, Emmanuele Chersoni

PDF

Open Access

TL;DR

This paper investigates how conversation length impacts the accuracy of large language models in multi-turn dialogues, revealing vulnerabilities not seen in single-turn evaluations and highlighting limitations of static benchmarking methods.

Contribution

It introduces a multi-turn evaluation framework for LLMs, demonstrating that conversation length influences response accuracy and exposes model-specific weaknesses.

Findings

01

Longer conversations can decrease LLM accuracy

02

Vulnerabilities vary across different models

03

Single-turn tests may miss critical weaknesses

Abstract

Single-prompt evaluations dominate current LLM benchmarking, yet they fail to capture the conversational dynamics where real-world harm occurs. In this study, we examined whether conversation length affects response veracity by evaluating LLM performance on the BoolQ dataset under varying length and scaffolding conditions. Our results across three distinct LLMs revealed model-specific vulnerabilities that are invisible under single-turn testing. The length-dependent and scaffold-specific effects we observed demonstrate a fundamental limitation of static evaluations, as deployment-relevant vulnerabilities could only be spotted in a multi-turn conversational setting.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques