ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation

Fatemeh Nazary; Ali Tourani; Yashar Deldjoo; Tommaso Di Noia

arXiv:2508.04206·cs.IR·August 7, 2025

ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation

Fatemeh Nazary, Ali Tourani, Yashar Deldjoo, Tommaso Di Noia

PDF

TL;DR

ViLLA-MMBench is a comprehensive, reproducible benchmark suite for evaluating multimodal movie recommendation systems that leverage large language models for data augmentation and fusion techniques.

Contribution

It introduces a unified, extensible benchmark with multimodal data, LLM-based metadata enrichment, and flexible fusion methods for improved movie recommendation evaluation.

Findings

01

LLM-based augmentation enhances cold-start performance.

02

Strong text embeddings improve coverage and diversity.

03

Fusion methods significantly impact recommendation accuracy.

Abstract

Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.