Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation
Yukin Zhang, Qi Dong, Kemu Xu

TL;DR
This paper introduces an information-theoretic framework revealing that architecture, not scale, primarily determines hierarchical processing boundaries in large language models, with consistent patterns across models.
Contribution
It formalizes the Multi-Scale Probabilistic Generation Theory (MSPGT) to explain how architecture shapes multi-scale representations in Transformer models.
Findings
Models develop discrete hierarchical boundaries influenced mainly by architecture.
Boundary positions are stable across size variations in Llama but vary in Qwen.
Local-segment brittleness varies over three orders of magnitude across architectures.
Abstract
Why do language models from different architecture families respond so differently to the same perturbation? We argue that the answer is not scale, but \emph{how architecture shapes information compression}. Analyzing eight Transformer models (7B--70B parameters) from the Llama and Qwen families, we show that every model spontaneously develops discrete functional boundaries dividing its layers into Local, Intermediate, and Global processing segments -- yet boundary locations and per-segment brittleness are determined overwhelmingly by architecture family rather than model size or training configuration. We formalize this regularity as the \textbf{Multi-Scale Probabilistic Generation Theory} (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions. Three predictions are strongly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
