LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers
Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yanqiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, Xinting Hu

TL;DR
LayerCake introduces a token-aware contrastive decoding method that aligns token types with specific transformer layers to enhance factual accuracy in large language models without additional training.
Contribution
It presents a novel layer-aware contrastive decoding approach that jointly considers token types and layer dynamics to improve factuality in LLM outputs.
Findings
Consistently improves factual accuracy across multiple LLMs.
Effectively suppresses attention to certain token types at specific layers.
No additional training or model modifications required.
Abstract
Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In this work, we introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation. Through empirical attention analysis, we identify two key patterns: punctuation tokens receive dominant attention in early layers, while conceptual tokens govern semantic reasoning in intermediate layers. By selectively suppressing attention to these token types at their respective depths, we achieve…
Peer Reviews
Decision·Submitted to ICLR 2026
- Experiment results on models like LLaMA, Mistral, and Qwen seems to improve factual accuracy on several benchmarks.
- The paper introduced so many hyperparams: α、tₕₐ、tₕ_b、β、layer ranges in the method, however, it's unclear how the authors find the hyperparams. The only clear thing is: > We set β = 0.1 throughout the paper. Otherwise, the authors mentioned that > thresholds tₕₐ, tₕ_b, and α are determined empirically but there is no explanation or details for it. What do you mean by "determined empirically"? It's possible that the author is adjusting the hyperparams based on each individual test set perf
Constructing a contrastive signal by purposefully inducing erroneous predictions through targeted interventions is an elegant idea. The intervention design, which selectively suppresses attention to specific token types at their most influential layers, creates a meaningful contrastive distribution that exposes how factual reasoning emerges within the model. The link between token category (e.g. structural vs. conceptual) and layer range (early vs. mid vs. late) seems well-motivated, and the res
Most evaluations focus on short-form, question-answering style tasks, which raises questions about the generality of the approach. It remains unclear whether the approach would perform equally well for more open-ended forms of text generation such as long-form summarization, creative writing, or code synthesis. These tasks involve richer discourse structures, longer contexts, and a more complex interplay of coherence and factuality than typical QA settings. The specific intervention strategy: em
* The general approach makes sense, that it is possible to identify more precise sources of "problematic" LLM behaviors, and use those to perform contrastive decoding that is more informed and targeted towards these areas (rather than a generic "weak model" vs. "expert model" scenario). * The specific method of using attention interventions on specific token categories is novel as far as I'm aware. * The results are presented over a wide array of benchmarks for different types of tasks, and show
1. The method relies on quite a lot of parameters and heuristics - I was not entirely convinced by the motivation for those, and at the same time the paper is somewhat unclear regarding the methodology for choosing them in practice. Specifically, there is $th_a$ and $th_b$, and $\alpha$; the choice of layers representing the "early" and "middle" stages of processing; a separate attention modification logic for punctuation and conceptual tokens; and the decision to perform contrast with each of t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
