Long-Sequence Recommendation Models Need Decoupled Embeddings
Ningya Feng, Junwei Pan, Jialong Wu, Baixu Chen, Ximei Wang, Qian Li,, Xian Hu, Jie Jiang, Mingsheng Long

TL;DR
This paper identifies a key limitation in long-sequence recommendation models related to shared embeddings for attention and representation, and proposes a decoupled embedding approach that improves accuracy and efficiency.
Contribution
The paper introduces DARE, a novel model with separate embeddings for attention and representation, addressing a critical deficiency in existing long-sequence recommendation systems.
Findings
DARE outperforms baselines with up to 0.9% AUC improvement on public datasets.
Decoupling embeddings reduces attention embedding dimension and speeds up search by 50%.
Extensive experiments validate the effectiveness of decoupled embeddings in recommendation accuracy and efficiency.
Abstract
Lifelong user behavior sequences are crucial for capturing user interests and predicting user responses in modern recommendation systems. A two-stage paradigm is typically adopted to handle these long sequences: a subset of relevant behaviors is first searched from the original long sequences via an attention mechanism in the first stage and then aggregated with the target item to construct a discriminative representation for prediction in the second stage. In this work, we identify and characterize, for the first time, a neglected deficiency in existing long-sequence recommendation models: a single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. Initial attempts to address this issue with some common methods (e.g., linear projections -- a technique borrowed from language processing) proved ineffective,…
Peer Reviews
Decision·ICLR 2025 Poster
1. Preliminary study is available. Authors provide experimental evidence supporting that using a shared embedding for attention calculation and preference prediction can impair performance due to the varying magnitudes and conflicting gradients. Furthermore, the authors examine the linear projection solutions of current methods, noting that the constrained embedding size may hinder the effectiveness of linear projection for decoupling. 2. Clarity. The paper presents its ideas with remarkable c
1. Applicability. This work aims to decouple attention and representation embeddings for sequence modeling. I wonder if the proposed solution in Section 2 can be extended to more scenarios. When the attention mechanism is not used in certain sequential models, the suggested method does not apply and, therefore, cannot benefit these models. 2. Inconsistent Solution without Novelty. The proposed solution is too straightforward-setting separate embedding tables for attention and representation. Th
The definition of the problem is very clear. The research perspective is quite innovative.
The definition and explanation of the gradient issue are one-sided, and the motivation and the logic of the method are not clear. For example, why do the impact of different modules on embedding gradients must be ‘equal-magnitude’ and ‘consistent’? How do imbalance-magnitude and inconsistent gradients degrade recommendation performance? Additionally, the connection between the proposed method and the target problem is weak. For example, why does the embedding decoupling approach help mitigate th
1. The paper clearly identifies a crucial problem in long-sequence recommendation models. The analysis of the interference between attention and representation learning due to shared embeddings is comprehensive. 2. The extensive experiments on public datasets (Taobao and Tmall) and an online advertising platform are a major strength. The comparison with a wide range of baselines, including state-of-the-art models like TWIN and its variants, demonstrates the superiority of the DARE model. The ev
1. The performance on the short-sequence modeling using the Amazon dataset was marginal. This suggests that the DARE model may not be as effective in short-sequence scenarios, and more research is needed to understand its applicability in different sequence lengths. Additionally, the experimental setup for some baselines, like DIN, required adjustments to fit the long-sequence context, which may introduce some biases in the comparison. 2. There are some results that lack a satisfactory explanat
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Healthcare
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
