Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ
Karl Neergaard, Le Qiu, Emmanuele Chersoni

TL;DR
This paper investigates how conversation length impacts the accuracy of large language models in multi-turn dialogues, revealing vulnerabilities not seen in single-turn evaluations and highlighting limitations of static benchmarking methods.
Contribution
It introduces a multi-turn evaluation framework for LLMs, demonstrating that conversation length influences response accuracy and exposes model-specific weaknesses.
Findings
Longer conversations can decrease LLM accuracy
Vulnerabilities vary across different models
Single-turn tests may miss critical weaknesses
Abstract
Single-prompt evaluations dominate current LLM benchmarking, yet they fail to capture the conversational dynamics where real-world harm occurs. In this study, we examined whether conversation length affects response veracity by evaluating LLM performance on the BoolQ dataset under varying length and scaffolding conditions. Our results across three distinct LLMs revealed model-specific vulnerabilities that are invisible under single-turn testing. The length-dependent and scaffold-specific effects we observed demonstrate a fundamental limitation of static evaluations, as deployment-relevant vulnerabilities could only be spotted in a multi-turn conversational setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques
