TL;DR
This paper introduces TASTE, a multimodal music recommendation dataset and framework, demonstrating the effectiveness of large-scale audio encoders and a new feature aggregation method for improved recommendation performance.
Contribution
It presents a new multimodal dataset, a benchmarking framework, and the MuQ-token method for efficient multi-layer audio feature integration in music recommendation.
Findings
Audio representations significantly improve recommendation tasks.
MuQ-token outperforms other feature integration methods.
Content-based approaches are validated as effective for music recommendation.
Abstract
Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
