TL;DR
LightTransfer is a novel hybrid model transformation method that enhances long-context language model efficiency with minimal performance loss, enabling faster generation and better resource utilization.
Contribution
The paper introduces LightTransfer, a lightweight, training-free method to convert transformer models into hybrid architectures for improved long-context processing.
Findings
Achieves up to 2.17× throughput improvement with minimal performance loss
Effectively transforms models like LLaMA into hybrid variants without training
Demonstrates strong results on diverse benchmarks, including LongBench and AIME24
Abstract
Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Software-Defined Networks and 5G
MethodsSoftmax · Attention Is All You Need
