Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, Jin-Long Li

TL;DR
This paper introduces Attention Editing, a practical framework for converting trained large language models to new attention architectures like MLA and SWA without retraining from scratch, improving efficiency.
Contribution
It proposes a novel attention conversion method using progressive distillation, enabling practical and robust adaptation of existing models to new attention mechanisms.
Findings
Models converted with Attention Editing maintain competitive performance.
Significant efficiency improvements are achieved after conversion.
Experiments demonstrate feasibility on large-scale hardware.
Abstract
Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
