The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious

Ferdinand M. Schessl

arXiv:2604.14414·cs.CL·April 17, 2026

The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious

Ferdinand M. Schessl

PDF

TL;DR

This paper reveals that many turn-level metrics in LLM conversation analysis are statistically unreliable due to autocorrelation, proposing a correction framework and highlighting widespread neglect of this issue.

Contribution

It systematically characterizes autocorrelation in turn-level metrics and introduces a two-stage correction method to improve statistical validity in LLM conversation evaluation.

Findings

01

42% of significant associations are spurious without correction

02

The proposed correction method improves replication from 30% to 57%

03

Most recent studies neglect autocorrelation correction in turn-level metrics

Abstract

Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.