On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

Yeonju Ro; Zhenyu Zhang; Souvik Kundu; Zhangyang Wang; Aditya Akella

arXiv:2506.09316·cs.LG·June 18, 2025

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella

PDF

Open Access 1 Repo

TL;DR

This paper introduces DSLA, a dual-state linear attention mechanism, and DSLA-Serve, an adaptive distillation framework that replaces Transformer layers with DSLA layers during inference, significantly improving efficiency while maintaining accuracy.

Contribution

The work presents a novel dual-state linear attention design and an online adaptive distillation method for transforming Transformers into more efficient models with minimal accuracy loss.

Findings

01

DSLA achieves better context preservation than traditional linear attention.

02

DSLA-Serve accelerates inference by over 2x compared to Llama2-7B.

03

The approach maintains performance across various NLP tasks.

Abstract

Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

utnslab/dsla-serve
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Attention Is All You Need