TL;DR
This paper introduces correlation dimension, a fractal measure to quantify the structural complexity of language models, revealing phases of training, hallucination tendencies, and degeneration in generated text.
Contribution
It proposes a novel fractal-geometric metric for analyzing LLMs, capturing hierarchical recurrence and long-range structure beyond traditional accuracy metrics.
Findings
Correlation dimension reveals three training phases.
It reflects context-dependent complexity.
It detects hallucinations and degeneration.
Abstract
Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
