Multiple Token Divergence: Measuring and Steering In-Context Computation Density

Vincent Herrmann; Eric Alcaide; Michael Wand; J\"urgen Schmidhuber

arXiv:2512.22944·cs.LG·December 30, 2025

Multiple Token Divergence: Measuring and Steering In-Context Computation Density

Vincent Herrmann, Eric Alcaide, Michael Wand, J\"urgen Schmidhuber

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Multiple Token Divergence (MTD), a lightweight metric for measuring and controlling the computational effort of language models during in-context reasoning, correlating with task difficulty and accuracy.

Contribution

The paper proposes MTD as a novel, non-invasive measure of in-context computation effort and introduces Divergence Steering to control generated text complexity.

Findings

01

MTD effectively distinguishes complex from simple tasks

02

MTD correlates positively with problem difficulty on reasoning benchmarks

03

Lower MTD is associated with more accurate reasoning

Abstract

Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is clear and well-motivated: the paper addresses limitations of PHi with a simpler, non-invasive method. 2. The evaluation is fairly comprehensive, covering both pre-trained models as well as those trained from scratch. While the model may require some adaptation (if it doesn't have an MTP head), it's less invasive than PHi. 3. MTD shows good correlation with task complexity and difficulty (e.g., on MATH). Furthermore, it's effective for CoT rationale selection when combined with NL

Weaknesses

1. The paper relies heavily on informal notions of concepts like "complexity", as well as "interesting" and "boring" tasks, instead of formal definitions. 2. The authors only focus on one "interesting" task in Section 3.1. It would be interesting to see if these results generalize to other "interesting" tasks. 3. Section 4.2 is missing a direction comparison to PHi for pre-trained models. Looking at the PHi paper, they report lower correlation with reasoning difficulty, but this could be due to

Reviewer 02Rating 6Confidence 3

Strengths

- The paper's most fascinating result (sec 4.2) is the decoupling of computational effort (MTD) from predictive plausibility (NLL). The finding that MTD correlates positively with MATH problem difficulty while NLL correlates negatively is a significant contribution. - The discovery that MTD and NLL are anti-correlated with respect to problem difficulty is very interesting. It provides a new, orthogonal axis for analyzing model behavior. NLL measures "plausibility" or "surprise," while MTD measur

Weaknesses

- The paper's discussion briefly notes that MTD may "entangle genuine computational effort with memorization". This is a problem because the shallow MTP head is likely trained to be very good at predicting common, high-frequency (like, memorized) n-grams. A high MTD might simply signal that the full model is generating a novel or rare sequence (for example, a specific fact or a unique turn of phrase) that the shallow head couldn't possibly predict, which is not necessarily the same as in-context

Reviewer 03Rating 6Confidence 3

Strengths

**Simplicity and Practicality:** The primary strength of MTD is its simplicity compared to prior methods like PHi. It avoids invasive architectural changes, complex loss functions, and unstable training. The ability to compute it post-hoc on models already equipped with MTP heads (like MiMo-7B) makes it a highly practical tool for analysis. **Novel Decoding Method:** Divergence Steering is a genuinely new mechanism for controlling generation. It introduces a steering parameter, α, that is conce

Weaknesses

**Limited Impact of Steering on Reasoning:** The most significant weakness is the acknowledged failure of Divergence Steering to improve performance on the core reasoning tasks analyzed. While the method shows interesting, task-dependent effects on toy creative problems, its inability to enhance mathematical reasoning in pre-trained models severely limits the practical impact of the paper's second main contribution. **Ambiguous Interpretation of the MTD Signal:** The paper finds that lower MTD

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications