Clustering-driven Memory Compression for On-device Large Language Models
Ondrej Bohdal, Pramit Saha, Umberto Michieli, Mete Ozay, Taha Ceritli

TL;DR
This paper proposes a clustering-based memory compression method for on-device large language models, effectively reducing memory size while maintaining or improving personalization and generation quality.
Contribution
It introduces a novel clustering approach to merge similar memories, balancing context efficiency and personalization, outperforming naive averaging and concatenation methods.
Findings
Reduces memory tokens significantly
Outperforms baseline memory compression strategies
Enhances generation quality with fixed context budget
Abstract
Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Parallel Computing and Optimization Techniques
