The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
Ferdinand M. Schessl

TL;DR
This paper reveals that many turn-level metrics in LLM conversation analysis are statistically unreliable due to autocorrelation, proposing a correction framework and highlighting widespread neglect of this issue.
Contribution
It systematically characterizes autocorrelation in turn-level metrics and introduces a two-stage correction method to improve statistical validity in LLM conversation evaluation.
Findings
42% of significant associations are spurious without correction
The proposed correction method improves replication from 30% to 57%
Most recent studies neglect autocorrelation correction in turn-level metrics
Abstract
Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
