# Multi-Level and Multi-Scale Feature Aggregation Using Pre-trained   Convolutional Neural Networks for Music Auto-tagging

**Authors:** Jongpil Lee, Juhan Nam

arXiv: 1703.01793 · 2017-08-02

## TL;DR

This paper introduces a multi-level, multi-scale feature aggregation method using pre-trained CNNs for music auto-tagging, significantly improving accuracy over previous methods on benchmark datasets.

## Contribution

The paper proposes a novel CNN architecture that captures diverse audio features at multiple levels and scales, enhancing music auto-tagging performance and transfer learning capabilities.

## Key findings

- Outperforms previous state-of-the-art on MagnaTagATune dataset
- Outperforms previous state-of-the-art on Million Song Dataset
- Effective in transfer learning scenarios

## Abstract

Music auto-tagging is often handled in a similar manner to image classification by regarding the 2D audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstractions. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. The architecture is trained in three steps. First, we conduct supervised feature learning to capture local audio features using a set of CNNs with different input sizes. Second, we extract audio features from each layer of the pre-trained convolutional networks separately and aggregate them altogether given a long audio clip. Finally, we put them into fully-connected networks and make final predictions of the tags. Our experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms previous state-of-the-arts on the MagnaTagATune dataset and the Million Song Dataset. We further show that the proposed architecture is useful in transfer learning.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.01793/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1703.01793/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/1703.01793/full.md

---
Source: https://tomesphere.com/paper/1703.01793