A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems

Yiming Zhu; Xu Liu; Ziyun Xu; Zheng Wu; Joena Zhang; Sirius Chen; Chenheli Hua; Silvester Yao; Qichao Que; Wentao Shi; Junfeng Pan; Linhong Zhu

arXiv:2605.09338·cs.IR·May 12, 2026

A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems

Yiming Zhu, Xu Liu, Ziyun Xu, Zheng Wu, Joena Zhang, Sirius Chen, Chenheli Hua, Silvester Yao, Qichao Que, Wentao Shi, Junfeng Pan, Linhong Zhu

PDF

TL;DR

This paper introduces a generalized framework utilizing multimodal large language models to improve multimedia understanding in large-scale recommendation systems, addressing integration challenges and demonstrating measurable performance gains.

Contribution

It presents a tripartite architecture for MM-LLM integration into recommendation pipelines, instantiated with a LLaMA2-based model for content interpretation and feature extraction.

Findings

01

Achieved a 0.35% increase in offline AUC.

02

Realized a 0.02% improvement in online metrics.

03

Validated the practical viability of MM-LLMs in industrial recommendation systems.

Abstract

Conventional recommendation systems frequently fail to fully exploit the high-dimensional semantic signals inherent in multimedia content, thereby limiting the fidelity of user preference modeling. While Multimodal Large Language Models (MM-LLMs) offer robust mechanisms for interpreting such complex data, their integration into latency-constrained, industrial-scale architectures remains a significant challenge. To address this, we propose a generalized framework for MM-LLM-driven multimedia understanding. Our methodology employs a tripartite architecture encompassing content interpretation, representation extraction, and systematic pipeline integration, instantiated via a LLaMA2-based model that generates descriptive captions subsequently ingested as tokenized categorical features. Empirical evaluation demonstrates the efficacy of this approach, yielding a $0.35%$ increase in offline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.