Comparison of SVD and factorized TDNN approaches for speech to text
Jeffrey Josanne Michael, Nagendra Kumar Goel, Navneeth K, Jonas, Robertson, Shravan Mishra

TL;DR
This paper compares SVD-based and bottleneck approaches for reducing real-time factor and word error rate in speech-to-text systems using hybrid HMM-DNN architectures, demonstrating significant improvements in efficiency and accuracy.
Contribution
It introduces SVD application to TDNN and LSTM layers for real-time speech recognition, offering an alternative to bottleneck layers with notable efficiency gains.
Findings
-61.57% relative reduction in RTF
Almost 1% relative decrease in WER
Effective in reverberant environments
Abstract
This work concentrates on reducing the RTF and word error rate of a hybrid HMM-DNN. Our baseline system uses an architecture with TDNN and LSTM layers. We find this architecture particularly useful for lightly reverberated environments. However, these models tend to demand more computation than is desirable. In this work, we explore alternate architectures employing singular value decomposition (SVD) is applied to the TDNN layers to reduce the RTF, as well as to the affine transforms of every LSTM cell. We compare this approach with specifying bottleneck layers similar to those introduced by SVD before training. Additionally, we reduced the search space of the decoding graph to make it a better fit to operate in real-time applications. We report -61.57% relative reduction in RTF and almost 1% relative decrease in WER for our architecture trained on Fisher data along with reverberated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Geophysical Methods and Applications
MethodsTest · Sigmoid Activation · Tanh Activation · Long Short-Term Memory
