QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou

Xinchen Luo; Jiangxia Cao; Tianyu Sun; Jinkai Yu; Rui Huang; Wei Yuan,; Hezheng Lin; Yichen Zheng; Shiyao Wang; Qigen Hu; Changqing Qiu; Jiaqi Zhang,; Xu Zhang; Zhiheng Yan; Jingming Zhang; Simin Zhang; Mingxing Wen; Zhaojie; Liu; Kun Gai; Guorui Zhou

arXiv:2411.11739·cs.IR·November 19, 2024

QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou

Xinchen Luo, Jiangxia Cao, Tianyu Sun, Jinkai Yu, Rui Huang, Wei Yuan,, Hezheng Lin, Yichen Zheng, Shiyao Wang, Qigen Hu, Changqing Qiu, Jiaqi Zhang,, Xu Zhang, Zhiheng Yan, Jingming Zhang, Simin Zhang, Mingxing Wen, Zhaojie, Liu, Kun Gai, Guorui Zhou

PDF

Open Access

TL;DR

This paper introduces QARM, a multi-modal recommendation framework that aligns and customizes multi-modal representations for improved user interest modeling in industry settings.

Contribution

It proposes a novel quantitative multi-modal framework that addresses representation unmatching and unlearning issues, enabling trainable and task-specific multi-modal representations.

Findings

01

Improved recommendation accuracy with aligned multi-modal representations.

02

Effective customization of multi-modal info for downstream tasks.

03

Addresses key limitations of pre-trained multi-modal models.

Abstract

In recent years, with the significant evolution of multi-modal large models, many recommender researchers realized the potential of multi-modal information for user interest modeling. In industry, a wide-used modeling architecture is a cascading paradigm: (1) first pre-training a multi-modal model to provide omnipotent representations for downstream services; (2) The downstream recommendation model takes the multi-modal representation as additional input to fit real user-item behaviours. Although such paradigm achieves remarkable improvements, however, there still exist two problems that limit model performance: (1) Representation Unmatching: The pre-trained multi-modal model is always supervised by the classic NLP/CV tasks, while the recommendation models are supervised by real user-item interaction. As a result, the two fundamentally different tasks' goals were relatively separate,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques