LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

Aleksandar Pramov

arXiv:2510.22829·cs.CV·October 28, 2025

LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

Aleksandar Pramov

PDF

TL;DR

This paper introduces a multimodal fusion system using a large language model backbone to predict commercial memorability, leveraging LLM-generated rationales for improved robustness and generalization.

Contribution

It presents a novel LLM-based multimodal fusion approach with rationale prompts, enhancing memorability prediction accuracy over baseline models.

Findings

01

LLM-based system outperforms baseline ensemble in robustness

02

Use of rationale prompts improves model interpretability

03

Fusion of visual and textual features enhances prediction accuracy

Abstract

This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.