WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural   Audio Source Separation

Amir Zadeh; Tianjun Ma; Soujanya Poria; Louis-Philippe Morency

arXiv:1911.09783·cs.LG·November 25, 2019·6 cites

WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation

Amir Zadeh, Tianjun Ma, Soujanya Poria, Louis-Philippe Morency

PDF

Open Access

TL;DR

This paper introduces WildMix, a diverse in-the-wild dataset for monoaural source separation, and proposes the Spectro-Temporal Transformer model that effectively captures long-range dependencies for improved separation performance.

Contribution

The paper presents a new challenging dataset WildMix and a novel Spectro-Temporal Transformer model with a specialized encoder for monoaural source separation.

Findings

01

STT outperforms previous baselines on WildMix

02

WildMix extends the diversity of in-the-wild audio recordings

03

Spectro-Temporal Encoder effectively captures temporal and spectral dependencies

Abstract

Monoaural audio source separation is a challenging research area in machine learning. In this area, a mixture containing multiple audio sources is given, and a model is expected to disentangle the mixture into isolated atomic sources. In this paper, we first introduce a challenging new dataset for monoaural source separation called WildMix. WildMix is designed with the goal of extending the boundaries of source separation beyond what previous datasets in this area would allow. It contains diverse in-the-wild recordings from 25 different sound classes, combined with each other using arbitrary composition policies. Source separation often requires modeling long-range dependencies in both temporal and spectral domains. To this end, we introduce a novel trasnformer-based model called Spectro-Temporal Transformer (STT). STT utilizes a specialized encoder, called Spectro-Temporal Encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax