MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for   Discriminative Music Modeling on Raw Waveforms

Kai Middlebrook; Shyam Sudhakaran; David Guy Brizan

arXiv:2104.02309·cs.SD·April 7, 2021

MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms

Kai Middlebrook, Shyam Sudhakaran, David Guy Brizan

PDF

Open Access

TL;DR

This paper introduces MuSLCAT, a novel multi-scale, multi-level convolutional attention transformer architecture for raw waveform music modeling, demonstrating improved efficiency and competitive performance on music tagging and genre recognition tasks.

Contribution

The paper proposes MuSLCAT and MuSLCAN architectures that effectively model hierarchical and sequential music features directly from raw waveforms, with a lightweight design and novel attention mechanisms.

Findings

01

MuSLCAT and MuSLCAN outperform state-of-the-art models on benchmark datasets.

02

Both architectures achieve competitive results with fewer parameters.

03

The models effectively capture multi-scale and multi-level music features.

Abstract

In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention-augmented convolution (AAC) blocks. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Linear Warmup With Linear Decay · Residual Connection · Layer Normalization · Label Smoothing · Adam · Multi-Head Attention