Learning Large-scale Universal User Representation with Sparse Mixture   of Experts

Caigao Jiang; Siqiao Xue; James Zhang; Lingyue Liu; Zhibo Zhu; Hongyan; Hao

arXiv:2207.04648·cs.LG·July 12, 2022

Learning Large-scale Universal User Representation with Sparse Mixture of Experts

Caigao Jiang, Siqiao Xue, James Zhang, Lingyue Liu, Zhibo Zhu, Hongyan, Hao

PDF

Open Access

TL;DR

This paper introduces SUPERMOE, a scalable framework using sparse mixture of experts to learn universal user representations across multiple tasks, addressing challenges of high-dimensionality and the seesaw phenomenon.

Contribution

The paper proposes a novel MoE transformer-based framework with a new loss function for multi-task user embedding, enabling billion- to trillion-parameter models.

Findings

01

Achieves state-of-the-art performance on public datasets.

02

Effective in real-world online business scenarios.

03

Addresses seesaw phenomenon with a task indicator loss.

Abstract

Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicated feature interactions over time and high dimensions of user features. Recent emerging foundation models, e.g., BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing (NLP) tasks, the parameters of user behaviour model come mostly from user embedding layer, which makes most existing works fail in training a universal user embedding of large scale. Furthermore, user representations are learned from multiple downstream tasks, and the past research work do not address the seesaw phenomenon. In this paper, we propose SUPERMOE, a generic framework to obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, and we can thus increase the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Advanced Graph Neural Networks

MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Layer Normalization · Linear Warmup With Linear Decay · Adam · Weight Decay · WordPiece · Softmax · Residual Connection