On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella

TL;DR
This paper introduces DSLA, a dual-state linear attention mechanism, and DSLA-Serve, an adaptive distillation framework that replaces Transformer layers with DSLA layers during inference, significantly improving efficiency while maintaining accuracy.
Contribution
The work presents a novel dual-state linear attention design and an online adaptive distillation method for transforming Transformers into more efficient models with minimal accuracy loss.
Findings
DSLA achieves better context preservation than traditional linear attention.
DSLA-Serve accelerates inference by over 2x compared to Llama2-7B.
The approach maintains performance across various NLP tasks.
Abstract
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Attention Is All You Need
