LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction
Aleksandar Pramov

TL;DR
This paper introduces a multimodal fusion system using a large language model backbone to predict commercial memorability, leveraging LLM-generated rationales for improved robustness and generalization.
Contribution
It presents a novel LLM-based multimodal fusion approach with rationale prompts, enhancing memorability prediction accuracy over baseline models.
Findings
LLM-based system outperforms baseline ensemble in robustness
Use of rationale prompts improves model interpretability
Fusion of visual and textual features enhances prediction accuracy
Abstract
This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
