Clustering-driven Memory Compression for On-device Large Language Models

Ondrej Bohdal; Pramit Saha; Umberto Michieli; Mete Ozay; Taha Ceritli

arXiv:2601.17443·cs.CL·January 27, 2026

Clustering-driven Memory Compression for On-device Large Language Models

Ondrej Bohdal, Pramit Saha, Umberto Michieli, Mete Ozay, Taha Ceritli

PDF

Open Access

TL;DR

This paper proposes a clustering-based memory compression method for on-device large language models, effectively reducing memory size while maintaining or improving personalization and generation quality.

Contribution

It introduces a novel clustering approach to merge similar memories, balancing context efficiency and personalization, outperforming naive averaging and concatenation methods.

Findings

01

Reduces memory tokens significantly

02

Outperforms baseline memory compression strategies

03

Enhances generation quality with fixed context budget

Abstract

Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Parallel Computing and Optimization Techniques