TL;DR
CultureMERT-95M is a new multi-cultural music foundation model that uses continual pre-training to improve cross-cultural music understanding, outperforming previous models on non-Western music tasks while maintaining Western music performance.
Contribution
The paper introduces a novel two-stage continual pre-training strategy and a multi-cultural dataset to enhance cross-cultural music representation learning.
Findings
4.9% improvement in non-Western music auto-tagging
Stable adaptation with limited resources using re-warming and re-decaying
Multi-cultural model outperforms single-culture models overall
Abstract
Recent advances in music foundation models have improved audio representation learning, yet their effectiveness across diverse musical traditions remains limited. We introduce CultureMERT-95M, a multi-culturally adapted foundation model developed to enhance cross-cultural music representation learning and understanding. To achieve this, we propose a two-stage continual pre-training strategy that integrates learning rate re-warming and re-decaying, enabling stable adaptation even with limited computational resources. Training on a 650-hour multi-cultural data mix, comprising Greek, Turkish, and Indian music traditions, results in an average improvement of 4.9% in ROC-AUC and AP across diverse non-Western music auto-tagging tasks, surpassing prior state-of-the-art, with minimal forgetting on Western-centric benchmarks. We further investigate task arithmetic, an alternative approach to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttentive Walk-Aggregating Graph Neural Network
