MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Bin Wu; Feifan Yang; Zhangming Chan; Yu-Ran Gu; Jiawei Feng; Chao Yi; Xiang-Rong Sheng; Han Zhu; Jian Xu; Mang Ye; Bo Zheng

arXiv:2512.07216·cs.IR·December 9, 2025

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng, Chao Yi, Xiang-Rong Sheng, Han Zhu, Jian Xu, Mang Ye, Bo Zheng

PDF

Open Access 1 Datasets

TL;DR

MUSE is a multimodal search framework for lifelong user interest modeling that effectively combines simple and complex multimodal signals across different stages, improving recommendation performance with minimal latency.

Contribution

The paper introduces MUSE, a novel two-stage multimodal framework that balances simplicity and richness in modeling, and shares large-scale industrial deployment data and practices.

Findings

01

Lightweight cosine similarity with high-quality embeddings outperforms complex retrieval methods.

02

MUSE achieves significant improvements in top-line metrics in Taobao advertising.

03

Open-sourced dataset enables further research on long-sequence multimodal modeling.

Abstract

Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TaoBao-MM/Taobao-MM
dataset· 2.4k dl
2.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Emotion and Mood Recognition · Mobile Crowdsensing and Crowdsourcing