Performance Based Cost Functions for End-to-End Speech Separation
Shrikant Venkataramani, Ryley Higa, Paris Smaragdis

TL;DR
This paper introduces new perceptually motivated loss functions for end-to-end speech separation, improving performance over traditional MSE by aligning better with human auditory perception.
Contribution
It proposes and evaluates novel loss functions based on perceptual metrics like SDR, SIR, SAR, and STOI for neural speech separation models.
Findings
Proposed loss functions outperform MSE in subjective listening tests.
Combining different perceptual metrics yields better separation results.
Flexible loss function design adapts to specific separation tasks.
Abstract
Recent neural network strategies for source separation attempt to model audio signals by processing their waveforms directly. Mean squared error (MSE) that measures the Euclidean distance between waveforms of denoised speech and the ground-truth speech, has been a natural cost-function for these approaches. However, MSE is not a perceptually motivated measure and may result in large perceptual discrepancies. In this paper, we propose and experiment with new loss functions for end-to-end source separation. These loss functions are motivated by BSS\_Eval and perceptual metrics like source to distortion ratio (SDR), source to interference ratio (SIR), source to artifact ratio (SAR) and short-time objective intelligibility ratio (STOI). This enables the flexibility to mix and match these loss functions depending upon the requirements of the task. Subjective listening tests reveal that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
