How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging
Andres Ferraro, Dmitry Bogdanov, Xavier Serra, Jay Ho Jeon, Jason Yoon

TL;DR
This paper investigates how reducing frequency bands and frame rates in CNN-based music auto-tagging models affects performance, aiming to optimize accuracy, storage, and computational efficiency.
Contribution
It systematically evaluates the impact of lowering input resolution in mel-spectrograms on model accuracy and efficiency in music auto-tagging tasks.
Findings
Reduced frequency bands and frame rates can maintain competitive accuracy.
Lower resolution inputs decrease data storage and training time.
Optimal trade-offs depend on specific application requirements.
Abstract
Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram representations and evaluate model performances that can be achieved by reducing the input size in terms of both lesser amount of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for comprehensive performance comparisons and then compare selected configurations on the larger Million Song Dataset. The results of this study can serve researchers and practitioners in their trade-off decision between accuracy of the models, data storage size and training and inference times.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
