How Do LLMs Use Their Depth?
Akshat Gupta, Jay Yeung, Gopala Anumanchipalli, Anna Ivanova

TL;DR
This paper investigates how large language models utilize their depth during inference, revealing a structured process where early layers generate high-frequency guesses and later layers refine predictions, with implications for efficiency improvements.
Contribution
It introduces the 'Guess-then-Refine' framework and provides detailed layer-wise analysis of LLM prediction dynamics through multiple case studies.
Findings
Early layers produce high-frequency token guesses.
Deeper layers refine guesses into contextually appropriate tokens.
Function words are predicted earlier than content words.
Abstract
Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model due to the lack of contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. We then examine the dynamic usage of layer depth through three case studies. (i)…
Peer Reviews
Decision·Submitted to ICLR 2026
* The authors ask an interesting question: how do LLMs use their many layers?
There is no novelty or contribution here. The authors simply report a few observations that stem from a single existing method (tunedlens). In my experience working with tuned lens and similar methods, the observations described here are taken for granted, and are probably documented at least implicitly in every paper that uses these tools. Moreover, I think the basic claims made by the authors are also previously reported elsewhere, with additional mechanistic detail and insight. For example:
* The paper propose a simple, direct “guess-then-refine” view of depth with lightweight metrics. It provides analysis of TuneLens method results. * Experiments cover multiple model families and tasks with transparent data. * The findings might have a practical usage beyond interpretability (e.g. early-exit and routing in LLM systems).
* The probe by tunelens is trained to mimic the final distribution, so agreement with the final layer is not independent evidence and may imprint the reported pattern. - If the layer embeddings are matched toward that of the last, naturally, it would generate related tokens and patterns, but that is from the affine mapping, but not the model. - I think this is the most critical issue. - The authors can try add a probe that does not target the final distribution (or potentially comb
- The three categories of tasks analysed not only show the breadth of the observations and the study, but rather also help clarify the observations from other categories. - The claims made are very clear and well supported. No extravagant claims are present. - The task difficulty vs layer prediction analysis is super nice.
- Any analysis on reasoning/CoT tasks would further significantly improve the quality and scope of manuscript. - The manuscript would benefit from a more detailed discussion on the results, what implications they might have for practitioners or any suggestions or algorithms to improve the performance (say prediction depth) on hard tasks, based on the observations in the manuscript.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
