Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Marco De Nadai, Andreas Damianou, Mounia Lalmas

TL;DR
This paper proposes a zero-finetuning framework that uses multimodal large language models to generate high-level semantic descriptions of videos, significantly improving personalized video recommendations by capturing deeper content understanding.
Contribution
Introducing a simple, recommendation system-agnostic method that leverages off-the-shelf MLLMs to inject high-level semantics into video recommendation pipelines without additional fine-tuning.
Findings
Outperforms traditional features on MicroLens-100K dataset
Enhances recommendation accuracy across multiple models
Demonstrates effectiveness of MLLMs as knowledge extractors
Abstract
Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
