TL;DR
This paper introduces a learned modulation front-end for music audio tagging that enhances feature extraction by mimicking perceptually motivated filter banks, leading to interpretable and effective representations.
Contribution
It proposes end-to-end learned modulation filter bank front-ends, ModNet and SincModNet, for improved music tagging without relying on extensive domain knowledge.
Findings
Modulation filtering yields promising tagging performance.
The approach offers visualizable and interpretable audio representations.
Performance is evaluated on the MagnaTagATune dataset.
Abstract
Convolutional Neural Networks have been extensively explored in the task of automatic music tagging. The problem can be approached by using either engineered time-frequency features or raw audio as input. Modulation filter bank representations that have been actively researched as a basis for timbre perception have the potential to facilitate the extraction of perceptually salient features. We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block. The structure is effectively analogous to a modulation filter bank, where the FIR filter center frequencies are learned in a data-driven manner. The expectation is that a perceptually motivated filter bank can provide a useful representation for identifying music features. Our experimental results provide a fully visualisable and interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
