Learning Unified User Quantized Tokenizers for User Representation

Chuan He; Yang Chen; Wuliang Huang; Tianyi Zheng; Jianhu Chen; Bin Dou; Yice Luo; Yun Zhu; Baokun Wang; Yongchao Liu; Xing Fu; Yu Cheng; Chuntao Hong; Weiqiang Wang; Xin-Wei Yao; Zhongle Xie

arXiv:2508.00956·cs.LG·October 1, 2025

Learning Unified User Quantized Tokenizers for User Representation

Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao, Zhongle Xie

PDF

Open Access

TL;DR

U2QT introduces a unified framework for user representation that combines cross-domain knowledge transfer with early fusion, utilizing quantized tokens for efficient storage and improved performance across multiple downstream tasks.

Contribution

The paper presents U2QT, a novel two-stage architecture that integrates cross-domain transfer and early fusion using quantized tokenizers, addressing scalability and generalization issues in user modeling.

Findings

01

Outperforms task-specific baselines in behavior prediction and recommendation.

02

Achieves significant storage and computation efficiency.

03

Supports seamless integration with language models for industrial applications.

Abstract

Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, we use the Qwen3 Embedding model to derive a compact yet expressive feature representation; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Advanced Graph Neural Networks · Topic Modeling