Long-Sequence Recommendation Models Need Decoupled Embeddings

Ningya Feng; Junwei Pan; Jialong Wu; Baixu Chen; Ximei Wang; Qian Li,; Xian Hu; Jie Jiang; Mingsheng Long

arXiv:2410.02604·cs.IR·March 27, 2025

Long-Sequence Recommendation Models Need Decoupled Embeddings

Ningya Feng, Junwei Pan, Jialong Wu, Baixu Chen, Ximei Wang, Qian Li,, Xian Hu, Jie Jiang, Mingsheng Long

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper identifies a key limitation in long-sequence recommendation models related to shared embeddings for attention and representation, and proposes a decoupled embedding approach that improves accuracy and efficiency.

Contribution

The paper introduces DARE, a novel model with separate embeddings for attention and representation, addressing a critical deficiency in existing long-sequence recommendation systems.

Findings

01

DARE outperforms baselines with up to 0.9% AUC improvement on public datasets.

02

Decoupling embeddings reduces attention embedding dimension and speeds up search by 50%.

03

Extensive experiments validate the effectiveness of decoupled embeddings in recommendation accuracy and efficiency.

Abstract

Lifelong user behavior sequences are crucial for capturing user interests and predicting user responses in modern recommendation systems. A two-stage paradigm is typically adopted to handle these long sequences: a subset of relevant behaviors is first searched from the original long sequences via an attention mechanism in the first stage and then aggregated with the target item to construct a discriminative representation for prediction in the second stage. In this work, we identify and characterize, for the first time, a neglected deficiency in existing long-sequence recommendation models: a single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. Initial attempts to address this issue with some common methods (e.g., linear projections -- a technique borrowed from language processing) proved ineffective,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

1. Preliminary study is available. Authors provide experimental evidence supporting that using a shared embedding for attention calculation and preference prediction can impair performance due to the varying magnitudes and conflicting gradients. Furthermore, the authors examine the linear projection solutions of current methods, noting that the constrained embedding size may hinder the effectiveness of linear projection for decoupling. 2. Clarity. The paper presents its ideas with remarkable c

Weaknesses

1. Applicability. This work aims to decouple attention and representation embeddings for sequence modeling. I wonder if the proposed solution in Section 2 can be extended to more scenarios. When the attention mechanism is not used in certain sequential models, the suggested method does not apply and, therefore, cannot benefit these models. 2. Inconsistent Solution without Novelty. The proposed solution is too straightforward-setting separate embedding tables for attention and representation. Th

Reviewer 02Rating 6Confidence 3

Strengths

The definition of the problem is very clear. The research perspective is quite innovative.

Weaknesses

The definition and explanation of the gradient issue are one-sided, and the motivation and the logic of the method are not clear. For example, why do the impact of different modules on embedding gradients must be ‘equal-magnitude’ and ‘consistent’? How do imbalance-magnitude and inconsistent gradients degrade recommendation performance? Additionally, the connection between the proposed method and the target problem is weak. For example, why does the embedding decoupling approach help mitigate th

Reviewer 03Rating 3Confidence 3

Strengths

1. The paper clearly identifies a crucial problem in long-sequence recommendation models. The analysis of the interference between attention and representation learning due to shared embeddings is comprehensive. 2. The extensive experiments on public datasets (Taobao and Tmall) and an online advertising platform are a major strength. The comparison with a wide range of baselines, including state-of-the-art models like TWIN and its variants, demonstrates the superiority of the DARE model. The ev

Weaknesses

1. The performance on the short-sequence modeling using the Amazon dataset was marginal. This suggests that the DARE model may not be as effective in short-sequence scenarios, and more research is needed to understand its applicability in different sequence lengths. Additionally, the experimental setup for some baselines, like DIN, required adjustments to fit the long-sequence context, which may introduce some biases in the comparison. 2. There are some results that lack a satisfactory explanat

Code & Models

Repositories

thuml/dare
pytorchOfficial

Videos

Long-Sequence Recommendation Models Need Decoupled Embeddings· slideslive

Taxonomy

TopicsMachine Learning in Healthcare

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training