CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation
Jingyu Li, Zhaocheng Du, Qianhui Zhu, kaiyuan Li, Zhicheng Zhang, Song-Li Wu, Chaolang Li, Pengwen Dai

TL;DR
CollectiveKV introduces a method to share and compress key-value caches across users in sequential recommendation systems, significantly reducing storage needs while maintaining or improving performance.
Contribution
The paper proposes a novel cross-user KV sharing mechanism that leverages shared information, reducing storage overhead in KV caches for sequential recommendation models.
Findings
KV sequences across users have significant similarities.
Shared KV can be compressed to 0.8% of original size.
Method maintains or improves recommendation performance.
Abstract
Sequential recommendation models are widely used in applications, yet they face stringent latency requirements. Mainstream models leverage the Transformer attention mechanism to improve performance, but its computational complexity grows with the sequence length, leading to a latency challenge for long sequences. Consequently, KV cache technology has recently been explored in sequential recommendation systems to reduce inference latency. However, KV cache introduces substantial storage overhead in sequential recommendation systems, which often have a large user base with potentially very long user history sequences. In this work, we observe that KV sequences across different users exhibit significant similarities, indicating the existence of collaborative signals in KV. Furthermore, we analyze the KV using singular value decomposition (SVD) and find that the information in KV can be…
Peer Reviews
Decision·ICLR 2026 Poster
- The motivation is clear. Through data-driven analysis, the existence of collaborative signals in KV caches is demonstrated from the perspectives of similarity and information decomposition, thus leading to the core idea of "decoupling and sharing". - The paper pioneers the incorporation of cross-user collaborative signals into KV cache compression, distinguishing it from traditional methods focused on individual sequences in LLMs. By decoupling KV information into shared and user-specific comp
- The paper notes that the selection of the user-specific KV dimension lacks an automated strategy (as discussed in Section 6) and currently relies on manual setting (e.g., dimensions within 4% of the attention head dimension). This could necessitate iterative experimentation across different datasets or models. - As shown in Figure 4, increasing the user-specific KV dimension can degrade performance, but the paper does not deeply explore the redundancy threshold. - Ablation experiments reve
--CollaborativeKV improve the efficiency of KV cache and at the same time the recommendation accuracy without tradeoff. --Practically, it solves the urgent latency problem by low-rank factorization for Transformer based sequential recommendation models. --The massive reduction in storage overhead has been verified on real-world datasets in Table 1.
-The proposed method highly relies on the user similarities, which might not be effective for applications with non-overlapping user interests. -The recommendation accuracy in terms of GAUC is increased by 0.005 or so with the increase of the global KV pool size to 10000 according to Figure5(a). This performance improvement cannot be ignored compared with its improvement in Table 1. So how to derive the conclusion in the last passage of Page 8: …a relatively small global KV pool is SUFFICIENT t
1. Groundbreaking Compression Rate and Performance Preservation: The method achieves an exceptional level of compression, reducing the KV cache size to as low as 0.8% of its original size (a compression ratio of 0.008). Crucially, this compression is achieved while simultaneously maintaining or even enhancing model performance across various target-attention and self-attention models (SIM, SDIM, ETA, TWIN, HSTU) and three long-sequence datasets. 2. Significant Inference Latency Reduction: Colle
1. Empirical Determination of User-Specific KV Dimensionality: A primary limitation acknowledged by the authors is the difficulty in determining the optimal dimensionality ($d_u$) for the low-dimensional user-specific KV. The choice of this dimension critically affects both the overall compression rate and the model's ability to capture sufficient personalized signals; if set incorrectly, performance may degrade or compression benefits may be lost. 2. Lack of Adaptive Strategy for $d_u$: Relate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
