Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation

Yongrui Fu; Jian Liu; Tao Li; Zonggang Wu; Shouke Qin; Hanmeng Liu

arXiv:2508.09664·cs.IR·August 14, 2025

Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation

Yongrui Fu, Jian Liu, Tao Li, Zonggang Wu, Shouke Qin, Hanmeng Liu

PDF

TL;DR

MUFASA is a novel multimodal recommendation model that effectively fuses item content and models long-term user interests using sparse attention, leading to improved recommendation accuracy and real-world performance.

Contribution

The paper introduces MUFASA, combining a multimodal fusion layer with a sparse attention-based alignment layer for long sequential recommendation, addressing content understanding and interest modeling challenges.

Findings

01

MUFASA outperforms state-of-the-art baselines on benchmark datasets.

02

Online A/B tests show significant improvements in production environments.

03

The model effectively captures multi-grained user interests across long sequences.

Abstract

Recent advances in multimodal recommendation enable richer item understanding, while modeling users' multi-scale interests across temporal horizons has attracted growing attention. However, effectively exploiting multimodal item sequences and mining multi-grained user interests to substantially bridge the gap between content comprehension and recommendation remain challenging. To address these issues, we propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation. Our model comprises two core components. First, the Multimodal Fusion Layer (MFL) leverages item titles as a cross-genre semantic anchor and is trained with a joint objective of four tailored losses that promote: (i) cross-genre semantic alignment, (ii) alignment to the collaborative space for recommendation, (iii) preserving the similarity structure defined by titles and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.