Comparison of SVD and factorized TDNN approaches for speech to text

Jeffrey Josanne Michael; Nagendra Kumar Goel; Navneeth K; Jonas; Robertson; Shravan Mishra

arXiv:2110.07027·cs.SD·October 15, 2021

Comparison of SVD and factorized TDNN approaches for speech to text

Jeffrey Josanne Michael, Nagendra Kumar Goel, Navneeth K, Jonas, Robertson, Shravan Mishra

PDF

Open Access

TL;DR

This paper compares SVD-based and bottleneck approaches for reducing real-time factor and word error rate in speech-to-text systems using hybrid HMM-DNN architectures, demonstrating significant improvements in efficiency and accuracy.

Contribution

It introduces SVD application to TDNN and LSTM layers for real-time speech recognition, offering an alternative to bottleneck layers with notable efficiency gains.

Findings

01

-61.57% relative reduction in RTF

02

Almost 1% relative decrease in WER

03

Effective in reverberant environments

Abstract

This work concentrates on reducing the RTF and word error rate of a hybrid HMM-DNN. Our baseline system uses an architecture with TDNN and LSTM layers. We find this architecture particularly useful for lightly reverberated environments. However, these models tend to demand more computation than is desirable. In this work, we explore alternate architectures employing singular value decomposition (SVD) is applied to the TDNN layers to reduce the RTF, as well as to the affine transforms of every LSTM cell. We compare this approach with specifying bottleneck layers similar to those introduced by SVD before training. Additionally, we reduced the search space of the decoding graph to make it a better fit to operate in real-time applications. We report -61.57% relative reduction in RTF and almost 1% relative decrease in WER for our architecture trained on Fisher data along with reverberated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Geophysical Methods and Applications

MethodsTest · Sigmoid Activation · Tanh Activation · Long Short-Term Memory