IAM: Efficient Inference through Attention Mapping between Different-scale LLMs

Yi Zhao; Zuchao Li; Hai Zhao

arXiv:2507.11953·cs.CL·July 17, 2025

IAM: Efficient Inference through Attention Mapping between Different-scale LLMs

Yi Zhao, Zuchao Li, Hai Zhao

PDF

Open Access

TL;DR

This paper introduces IAM, a novel framework that leverages attention matrix similarity across different-scale LLMs to improve inference efficiency by accelerating attention computation and reducing KV cache usage.

Contribution

The paper presents a new attention mapping method between small and large LLMs, enhancing inference speed and cache efficiency without sacrificing performance.

Findings

01

Accelerates prefill by 15%

02

Reduces KV cache usage by 22.1%

03

Demonstrates generalizability across models

Abstract

LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications