Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya

TL;DR
This study explores the intrinsic dimension of texts, revealing its relationship with genre, linguistic features, and interpretability, and demonstrating how different text properties influence the complexity captured by language models.
Contribution
It is the first comprehensive analysis linking intrinsic dimension to interpretable text properties, genre stratification, and causal linguistic features using cross-encoder analysis and autoencoders.
Findings
ID is uncorrelated with entropy-based metrics after controlling for length
Scientific texts have low ID, while creative writing has high ID across models
Formal signals decrease ID; humanized signals increase ID, confirmed as causal effects
Abstract
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Mental Health via Writing · Text Readability and Simplification
