Multi-Channel Masking with Learnable Filterbank for Sound Source Separation
Wang Dai, Archontis Politis, Tuomas Virtanen

TL;DR
This paper introduces a learnable filterbank for multi-channel sound source separation, estimating masks for each microphone channel to improve separation performance over traditional methods.
Contribution
It proposes a novel multi-channel masking framework using a learnable 1D Conv filterbank, enhancing separation by applying channel-specific masks in a learned feature domain.
Findings
Outperforms single-channel masking with learnable filterbank
Can surpass multi-channel complex masking with STFT in certain models
Demonstrates spatial selectivity in the learned filterbank domain
Abstract
This work proposes a learnable filterbank based on a multi-channel masking framework for multi-channel source separation. The learnable filterbank is a 1D Conv layer, which transforms the raw waveform into a 2D representation. In contrast to the conventional single-channel masking method, we estimate a mask for each individual microphone channel. The estimated masks are then applied to the transformed waveform representation like in the traditional filter-and-sum beamforming operation. Specifically, each mask is used to multiply the corresponding channel's 2D representation, and the masked output of all channels are then summed. At last, a 1D transposed Conv layer is used to convert the summed masked signal into the waveform domain. The experimental results show our method outperforms single-channel masking with a learnable filterbank and can outperform multi-channel complex masking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Acoustic Wave Phenomena Research · Music and Audio Processing
