Learning Large-scale Universal User Representation with Sparse Mixture of Experts
Caigao Jiang, Siqiao Xue, James Zhang, Lingyue Liu, Zhibo Zhu, Hongyan, Hao

TL;DR
This paper introduces SUPERMOE, a scalable framework using sparse mixture of experts to learn universal user representations across multiple tasks, addressing challenges of high-dimensionality and the seesaw phenomenon.
Contribution
The paper proposes a novel MoE transformer-based framework with a new loss function for multi-task user embedding, enabling billion- to trillion-parameter models.
Findings
Achieves state-of-the-art performance on public datasets.
Effective in real-world online business scenarios.
Addresses seesaw phenomenon with a task indicator loss.
Abstract
Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicated feature interactions over time and high dimensions of user features. Recent emerging foundation models, e.g., BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing (NLP) tasks, the parameters of user behaviour model come mostly from user embedding layer, which makes most existing works fail in training a universal user embedding of large scale. Furthermore, user representations are learned from multiple downstream tasks, and the past research work do not address the seesaw phenomenon. In this paper, we propose SUPERMOE, a generic framework to obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, and we can thus increase the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Topic Modeling · Advanced Graph Neural Networks
MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Layer Normalization · Linear Warmup With Linear Decay · Adam · Weight Decay · WordPiece · Softmax · Residual Connection
