Implicit Filter-and-sum Network for Multi-channel Speech Separation
Yi Luo, Nima Mesgarani

TL;DR
This paper introduces iFaSNet, an improved version of FaSNet for multi-channel speech separation, using implicit filtering in latent space and feature-level NCC, achieving significant performance gains.
Contribution
The paper proposes a novel implicit filter-and-sum approach and feature-level NCC features to enhance FaSNet's performance in speech separation tasks.
Findings
iFaSNet outperforms FaSNet across all tested conditions.
The implicit formulation better matches end-to-end separation objectives.
Feature-level NCC improves model's feature representation.
Abstract
Various neural network architectures have been proposed in recent years for the task of multi-channel speech separation. Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries. In this paper, we investigate multiple ways to improve the performance of FaSNet. From the problem formulation perspective, we change the explicit time-domain filter-and-sum operation which involves all the microphones into an implicit filter-and-sum operation in the latent space of only the reference microphone. The filter-and-sum operation is applied on a context around the frame to be separated. This allows the problem formulation to better match the objective of end-to-end separation. From the feature extraction perspective, we modify the calculation of sample-level normalized cross…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
