Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

Vladislav Pedashenko; Laida Kushnareva; Yana Khassan Nibal; Eduard Tulchinskii; Kristian Kuznetsov; Vladislav Zharchinskii; Yury Maximov; Irina Piontkovskaya

arXiv:2511.15210·cs.CL·November 20, 2025

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya

PDF

Open Access

TL;DR

This study explores the intrinsic dimension of texts, revealing its relationship with genre, linguistic features, and interpretability, and demonstrating how different text properties influence the complexity captured by language models.

Contribution

It is the first comprehensive analysis linking intrinsic dimension to interpretable text properties, genre stratification, and causal linguistic features using cross-encoder analysis and autoencoders.

Findings

01

ID is uncorrelated with entropy-based metrics after controlling for length

02

Scientific texts have low ID, while creative writing has high ID across models

03

Formal signals decrease ID; humanized signals increase ID, confirmed as causal effects

Abstract

Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Mental Health via Writing · Text Readability and Simplification