Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Wael Hafez, Amir Nazeri

TL;DR
This paper introduces Bipredictability, a token-based metric for monitoring conversational consistency in multi-turn LLM interactions, demonstrating its effectiveness in detecting structural issues without relying on embeddings or model internals.
Contribution
It proposes Bipredictability and the Information Digital Twin as lightweight tools for real-time structural consistency monitoring in multi-turn LLM conversations.
Findings
Bipredictability aligns with structural consistency in 85% of conditions.
IDT detects all tested contradictions, topic shifts, and non-sequiturs with 100% sensitivity.
Structural monitoring complements semantic evaluation in LLM deployment.
Abstract
Large language models, LLMs, are increasingly deployed in multiturn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators or access to model internals. We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. Across 4,574…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
