ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
Fatemeh Nazary, Ali Tourani, Yashar Deldjoo, Tommaso Di Noia

TL;DR
ViLLA-MMBench is a comprehensive, reproducible benchmark suite for evaluating multimodal movie recommendation systems that leverage large language models for data augmentation and fusion techniques.
Contribution
It introduces a unified, extensible benchmark with multimodal data, LLM-based metadata enrichment, and flexible fusion methods for improved movie recommendation evaluation.
Findings
LLM-based augmentation enhances cold-start performance.
Strong text embeddings improve coverage and diversity.
Fusion methods significantly impact recommendation accuracy.
Abstract
Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
