Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Marco De Nadai; Andreas Damianou; Mounia Lalmas

arXiv:2508.09789·cs.IR·August 14, 2025

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Marco De Nadai, Andreas Damianou, Mounia Lalmas

PDF

1 Datasets

TL;DR

This paper proposes a zero-finetuning framework that uses multimodal large language models to generate high-level semantic descriptions of videos, significantly improving personalized video recommendations by capturing deeper content understanding.

Contribution

Introducing a simple, recommendation system-agnostic method that leverages off-the-shelf MLLMs to inject high-level semantics into video recommendation pipelines without additional fine-tuning.

Findings

01

Outperforms traditional features on MicroLens-100K dataset

02

Enhances recommendation accuracy across multiple models

03

Demonstrates effectiveness of MLLMs as knowledge extractors

Abstract

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

marcodena/video-recs-describe-what-you-see
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.