Harmonics to the Rescue: Why Voiced Speech is Not a Wss Process
Giovanni Bologni, Richard Heusdens, Richard C. Hendriks

TL;DR
This paper argues that voiced speech is better modeled as a cyclostationary process rather than a wide-sense stationary process, leading to improved spectral estimation and source separation in speech processing.
Contribution
It introduces the use of cyclostationary modeling for voiced speech, challenging the traditional WSS assumption and demonstrating its advantages through simulations and real data.
Findings
Cyclostationary model improves spectral density estimation.
Enhanced source separation and beamforming performance.
Validation with real speech data supports the model's effectiveness.
Abstract
Speech processing algorithms often rely on statistical knowledge of the underlying process. Despite many years of research, however, the debate on the most appropriate statistical model for speech still continues. Speech is commonly modeled as a wide-sense stationary (WSS) process. However, the use of the WSS model for spectrally correlated processes is fundamentally wrong, as WSS implies spectral uncorrelation. In this paper, we demonstrate that voiced speech can be more accurately represented as a cyclostationary (CS) process. By employing the CS rather than the WSS model for processes that are inherently correlated across frequency, it is possible to improve the estimation of cross-power spectral densities (PSDs), source separation, and beamforming. We illustrate how the correlation between harmonic frequencies of CS processes can enhance system identification, and validate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
