3-D Feature and Acoustic Modeling for Far-Field Speech Recognition
Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy

TL;DR
This paper introduces a novel 3-D feature extraction and acoustic modeling approach using multi variate autoregressive modeling and 3-D CNNs for far-field speech recognition, outperforming traditional beamforming methods.
Contribution
It proposes a direct multi-channel feature extraction method with 3-D CNN acoustic modeling, eliminating the need for beamforming enhancement in reverberant conditions.
Findings
Significant WER reduction on CHiME-3 dataset
Improved recognition accuracy on REVERB Challenge dataset
Outperforms traditional beamforming-based systems
Abstract
Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
