End-to-end learning for music audio tagging at scale

Jordi Pons; Oriol Nieto; Matthew Prockup; Erik Schmidt; Andreas; Ehmann; Xavier Serra

arXiv:1711.02520·cs.SD·June 18, 2018·82 cites

End-to-end learning for music audio tagging at scale

Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas, Ehmann, Xavier Serra

PDF

Open Access 4 Repos

TL;DR

This paper compares waveform-based and spectrogram-based deep learning models for music auto-tagging, demonstrating that waveform models excel with large-scale datasets, while spectrogram models are more effective with limited data.

Contribution

It provides a comprehensive analysis of how different deep learning architectures perform on music tagging across various dataset sizes, highlighting the importance of domain assumptions.

Findings

01

Waveform models outperform spectrogram models on large datasets.

02

Spectrogram models are more effective with smaller datasets.

03

Large-scale data enables waveform-based models to excel.

Abstract

The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study, 1.2M tracks annotated with musical labels are available to train our end-to-end models. This large amount of data allows us to unrestrictedly explore two different design paradigms for music auto-tagging: assumption-free models - using waveforms as input with very small convolutional filters; and models that rely on domain knowledge - log-mel spectrograms with a convolutional neural network designed to learn timbral and temporal features. Our work focuses on studying how these two types of deep architectures perform when datasets of variable size are available for training: the MagnaTagATune (25k songs), the Million Song Dataset (240k songs), and a private dataset of 1.2M songs. Our experiments suggest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing