A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems
Yiming Zhu, Xu Liu, Ziyun Xu, Zheng Wu, Joena Zhang, Sirius Chen, Chenheli Hua, Silvester Yao, Qichao Que, Wentao Shi, Junfeng Pan, Linhong Zhu

TL;DR
This paper introduces a generalized framework utilizing multimodal large language models to improve multimedia understanding in large-scale recommendation systems, addressing integration challenges and demonstrating measurable performance gains.
Contribution
It presents a tripartite architecture for MM-LLM integration into recommendation pipelines, instantiated with a LLaMA2-based model for content interpretation and feature extraction.
Findings
Achieved a 0.35% increase in offline AUC.
Realized a 0.02% improvement in online metrics.
Validated the practical viability of MM-LLMs in industrial recommendation systems.
Abstract
Conventional recommendation systems frequently fail to fully exploit the high-dimensional semantic signals inherent in multimedia content, thereby limiting the fidelity of user preference modeling. While Multimodal Large Language Models (MM-LLMs) offer robust mechanisms for interpreting such complex data, their integration into latency-constrained, industrial-scale architectures remains a significant challenge. To address this, we propose a generalized framework for MM-LLM-driven multimedia understanding. Our methodology employs a tripartite architecture encompassing content interpretation, representation extraction, and systematic pipeline integration, instantiated via a LLaMA2-based model that generates descriptive captions subsequently ingested as tokenized categorical features. Empirical evaluation demonstrates the efficacy of this approach, yielding a increase in offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
