The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
Alex Bogdan, Adrian de Valois-Franklin

TL;DR
This paper uncovers a universal statistical pattern in large language model outputs that enables a fast, model-agnostic verification primitive for assessing and authenticating model outputs in real-time.
Contribution
It introduces a new statistical regularity in LLM outputs, enabling a rapid, model-agnostic scoring primitive for provenance verification and output assessment.
Findings
Token rank-frequency distributions follow a Mandelbrot distribution across models and domains.
The primitive achieves up to 100,000× latency reduction compared to existing detectors.
Fitted Mandelbrot parameters effectively distinguish between different models.
Abstract
We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000 (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross-model spread in (1.63 to 3.69) exceeds its per-model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
