On the Embedding Collapse when Scaling up Recommendation Models
Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, Mingsheng, Long

TL;DR
This paper investigates the embedding collapse phenomenon in large recommendation models, analyzing its causes and effects, and proposes a multi-embedding design with interaction modules to improve scalability and reduce collapse.
Contribution
It identifies embedding collapse as a key scalability issue and introduces a multi-embedding approach with interaction modules to mitigate it in recommendation models.
Findings
Embedding collapse restricts embedding learning and scalability.
Interaction modules help mitigate embedding collapse.
Proposed design improves scalability and reduces collapse across models.
Abstract
Recent advances in foundation models have led to a promising trend of developing large recommendation models to leverage vast amounts of available data. Still, mainstream models remain embarrassingly small in size and na\"ive enlarging does not lead to sufficient performance gain, suggesting a deficiency in the model scalability. In this paper, we identify the embedding collapse phenomenon as the inhibition of scalability, wherein the embedding matrix tends to occupy a low-dimensional subspace. Through empirical and theoretical analysis, we demonstrate a \emph{two-sided effect} of feature interaction specific to recommendation models. On the one hand, interacting with collapsed embeddings restricts embedding learning and exacerbates the collapse issue. On the other hand, interaction is crucial in mitigating the fitting of spurious features as a scalability guarantee. Based on our…
Peer Reviews
Decision·ICML 2024 Poster
Originality: - The paper investigates the enlarged embedding layers of recommendation models and identifies a phenomenon of embedding collapse, wherein the embedding matrix tends to reside in a low-dimensional subspace. The discovery is novel as far as I know. - The paper proposed information abundance to measure the degree of collapse for embedding matrices. Quality: - The paper is well-written. It starts with a novel finding of embedding collapse when increasing embedding dimension which mig
- In section 3, the paper proposes Information Abundance to measure the degree of collapse of embedding matrices. As the paper focuses on the scaling law of embedding layers, the paper should discuss whether Information Abundance is a fair metric when comparing embedding matrices of different dimension sizes. - In section 4.2, the paper uses regularized DCNv2 as an example to show that suppressing feature interaction is insufficient for scalability. It is unclear to me why feature interaction in
S1. This paper provides empirical and theoretical analysis of the embedding collapse phenomenon. S2. This paper provides information abundance for quantifying the degree of collapse for such matrices with low-rank tendencies.
W1. The novelty of this paper seems to be limited. The method of dividing the single embedding into multi-embedding sets is similar to DMRL[1] for disentangled representation learning. DMRL divides the feature representation of each modality into k chunks. As a result, the features of different factors are entangled. W2. The motivation is not completely solid. The reason for increasing the embedding size of the model is inappropriate. W3. The experimental results of the paper are insufficient.
Originality: - ‘Information Abundance' as a quantitative novel measure to measure the embedding layer collapse. - The 'Interaction-Collapse Law' and the two sided effect of feature interaction process helps improve the understanding of embeddings' behavior in recommendation systems. Quality: - The authors have detailed exploration of embeddings and their behavior, particularly in the context of information collapse with rigorous visualizations. Significance: - Broad Implications for Recommend
1. Insufficient Empirical Validation on Large-Scale Data: The authors have shown with empirical evidences that large scale recommendation models scale poorly. However it is a common knowledge that large scale models are inherently data hungry to achieve better model convergence. This is an important premise that the paper relies on, it would good if authors can follow up to prove/disprove this as additional data points in this paper. The experiments seem to be on same amount of training data on
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Machine Learning in Healthcare
