MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling
Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng, Chao Yi, Xiang-Rong Sheng, Han Zhu, Jian Xu, Mang Ye, Bo Zheng

TL;DR
MUSE is a multimodal search framework for lifelong user interest modeling that effectively combines simple and complex multimodal signals across different stages, improving recommendation performance with minimal latency.
Contribution
The paper introduces MUSE, a novel two-stage multimodal framework that balances simplicity and richness in modeling, and shares large-scale industrial deployment data and practices.
Findings
Lightweight cosine similarity with high-quality embeddings outperforms complex retrieval methods.
MUSE achieves significant improvements in top-line metrics in Taobao advertising.
Open-sourced dataset enables further research on long-sequence multimodal modeling.
Abstract
Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Emotion and Mood Recognition · Mobile Crowdsensing and Crowdsourcing
