Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation
Ui-Hyeop Shin, Hyung-Min Park

TL;DR
This paper introduces SR-CorrNet, an asymmetric encoder-decoder model with a structured correlation-to-filter approach and dynamic speaker adaptation, significantly improving speech separation in complex acoustic environments.
Contribution
It proposes a novel asymmetric TF encoder-decoder architecture with correlation-based filter estimation and dynamic speaker number adaptation, advancing speech separation methods.
Findings
Consistent improvements on WSJ0-Mix, WHAMR!, and LibriCSS datasets.
Effective in both single- and multi-channel settings.
Enhances separation quality in noisy and reverberant conditions.
Abstract
Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
