The Statistical Signature of LLMs
Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccol\`o Di Marco, Matteo Cinelli, Walter Quattrociocchi

TL;DR
This paper demonstrates that lossless compression can serve as a model-agnostic measure to identify structural signatures of probabilistic language generation by large language models across various contexts.
Contribution
It introduces a novel, surface-text-based framework using compression to quantify how LLMs alter the statistical structure of language across different environments.
Findings
LLMs produce more regular and compressible text than humans.
Compression reveals a persistent structural signature of probabilistic generation.
The signature's scale dependence limits surface-level distinguishability at small scales.
Abstract
Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Text Readability and Simplification · Authorship Attribution and Profiling
