Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning

Xiaofeng Pan; Jing Chen; Haitong Zhang; Menglin Xing; Jiayi Wei; Xuefeng Mu; Zhongqian Xie

arXiv:2505.23298·cs.SD·May 30, 2025

Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning

Xiaofeng Pan, Jing Chen, Haitong Zhang, Menglin Xing, Jiayi Wei, Xuefeng Mu, Zhongqian Xie

PDF

1 Models

TL;DR

This paper introduces a hierarchical contrastive learning approach that effectively bridges semantic and user preference spaces in multi-modal music representation, enhancing both semantic understanding and recommendation accuracy.

Contribution

It proposes a novel two-stage contrastive learning framework that integrates semantic and user preference modeling for improved music representation learning.

Findings

01

Effective in learning comprehensive music representations

02

Improves performance on music semantic tasks

03

Enhances music recommendation accuracy

Abstract

Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Tharya/HTCL
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Softmax · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Multi-Head Attention · Dropout · Residual Connection