TL;DR
This paper introduces a novel spectrogram computation method using Frequency Domain Linear Prediction (FDLP), which captures speech features effectively and improves recognition accuracy in challenging conditions.
Contribution
It presents a new FDLP-based spectrogram technique that outperforms traditional mel spectrograms in end-to-end speech recognition, especially under domain mismatch and reverberation.
Findings
FDLP spectrogram matches mel spectrogram performance on clean speech.
FDLP achieves up to 25% WER reduction in mismatched conditions.
FDLP captures low-frequency speech modulations effectively.
Abstract
We propose a technique to compute spectrograms using Frequency Domain Linear Prediction (FDLP) that uses all-pole models to fit the squared Hilbert envelope of speech in different frequency sub-bands. The spectrogram of a complete speech utterance is computed by overlap-add of contiguous all-pole model responses. A long context window of 1.5 seconds allows us to capture the low frequency temporal modulations of speech in the spectrogram. For an end-to-end automatic speech recognition task, the FDLP spectrogram performs on par with the standard mel spectrogram features for clean read speech training and test data. For more realistic speech data with train-test domain mismatches or reverberations, FDLP spectrogram shows up to 25% and 22% relative WER improvements over mel spectrogram respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
