Distilling to Hybrid Attention Models via KL-Guided Layer Selection
Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim

TL;DR
This paper proposes an efficient method for selecting layers to convert in Transformer models, enabling effective distillation into hybrid attention architectures that improve inference efficiency without extensive pretraining.
Contribution
It introduces a simple layer importance scoring method for hybrid attention model distillation, outperforming existing heuristic and diagnostic-based approaches.
Findings
Layer importance scores guide effective layer conversion.
The method improves inference efficiency of LLMs.
Outperforms heuristic and diagnostic-based layer selection methods.
Abstract
Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective…
Peer Reviews
Decision·ICLR 2026 Poster
* **Addresses a Practical**. Tackles the challenge of converting pretrained softmax attention Transformers into more efficient hybrid architectures without expensive pretraining from scratch. Focuses on improving inference efficiency of LLMs, which is a critical concern in practical deployments * **Intuitive Layer Selection Approach**. Proposes a KL-guided layer selection criterion that is both simple and theoretically motivated. The intuition is clear: layers that are more critical for maintain
* How does this method compare to a baseline that just randomly selects the layers to replace with linear ones i.e instead of Uniform or any of the fancy methods of selecting the layers to linearize if we just randomly chose K, layers and linearized them, how would that affect the performance. * The number of layers to be linearized K seems to be a heuristic or dataset dependent ? It makes the method feel somewhat brittle. Would this automatically transfer to some other task or would you need t
- The paper is very well structured and written, the sections flow naturally from one to another, and the motivation is very clear. - The experiments are thorough, and the authors compare their method to various baselines and show that their method outperforms the others in settings where the vast majority (75% or more) of the layers are linear layers rather than softmax layers. - The method is simple and effective, and does not require much computing (compared to the total pre-training cost) to
- On line 339, the authors state that their method is iterative; however, this is not totally correct, their method calculates the importance of each softmax layer independently, and then selects the top-K best. This ignores how adding one softmax layer can affect the importance of other softmax layers, which an iterative process (add one, recalculate importance, add another, etc.) would take into account. - It would be interesting and helpful for the paper to know which layers were chosen. Whet
Despite its simplicity, the method achieves large gains—especially in the low-softmax regime (e.g., 12.5% ratio). The KL-based layer importance metric is conceptually elegant and empirically grounded.
All experiments focus on 3B-class decoder-only models; no results for encoder–decoder or smaller-scale models. While the paper emphasizes efficiency, explicit measurements of inference latency or memory savings are missing. The accuracy of layer importance relies heavily on the stability of Stage-2 KL distillation. If the teacher–student mismatch is large, the ranking may become unreliable.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Advanced Neural Network Applications
