A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

Jason Gaitonde; Frederic Koehler; Elchanan Mossel; Joonhyung Shin; and Allan Sly

arXiv:2605.13687·cs.LG·May 14, 2026

A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

Jason Gaitonde, Frederic Koehler, Elchanan Mossel, Joonhyung Shin, and Allan Sly

PDF

TL;DR

This paper introduces a hierarchical synthetic language model to analyze the effects of context length and reasoning, demonstrating that reasoning models with logarithmic memory can efficiently generate sequences with provable benefits.

Contribution

The authors develop an exact $k$-gram ansatz for analyzing hierarchical language models, providing explicit asymptotic predictions and demonstrating exponential improvements in reasoning with limited context.

Findings

01

Variance of generated sum scales log-linearly with context depth

02

Bounded-context autoregression often produces invalid sequences

03

Reasoning models with logarithmic memory can sample exactly from the true language

Abstract

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$ -gram ansatz} in place of transformers with context length $k$ , a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.