WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation
Amir Zadeh, Tianjun Ma, Soujanya Poria, Louis-Philippe Morency

TL;DR
This paper introduces WildMix, a diverse in-the-wild dataset for monoaural source separation, and proposes the Spectro-Temporal Transformer model that effectively captures long-range dependencies for improved separation performance.
Contribution
The paper presents a new challenging dataset WildMix and a novel Spectro-Temporal Transformer model with a specialized encoder for monoaural source separation.
Findings
STT outperforms previous baselines on WildMix
WildMix extends the diversity of in-the-wild audio recordings
Spectro-Temporal Encoder effectively captures temporal and spectral dependencies
Abstract
Monoaural audio source separation is a challenging research area in machine learning. In this area, a mixture containing multiple audio sources is given, and a model is expected to disentangle the mixture into isolated atomic sources. In this paper, we first introduce a challenging new dataset for monoaural source separation called WildMix. WildMix is designed with the goal of extending the boundaries of source separation beyond what previous datasets in this area would allow. It contains diverse in-the-wild recordings from 25 different sound classes, combined with each other using arbitrary composition policies. Source separation often requires modeling long-range dependencies in both temporal and spectral domains. To this end, we introduce a novel trasnformer-based model called Spectro-Temporal Transformer (STT). STT utilizes a specialized encoder, called Spectro-Temporal Encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
