IAM: Efficient Inference through Attention Mapping between Different-scale LLMs
Yi Zhao, Zuchao Li, Hai Zhao

TL;DR
This paper introduces IAM, a novel framework that leverages attention matrix similarity across different-scale LLMs to improve inference efficiency by accelerating attention computation and reducing KV cache usage.
Contribution
The paper presents a new attention mapping method between small and large LLMs, enhancing inference speed and cache efficiency without sacrificing performance.
Findings
Accelerates prefill by 15%
Reduces KV cache usage by 22.1%
Demonstrates generalizability across models
Abstract
LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
