Attention based end to end Speech Recognition for Voice Search in Hindi and English
Raviraj Joshi, Venkateshan Kannan

TL;DR
This paper enhances end-to-end speech recognition for Hindi and English voice search by improving attention mechanisms and training strategies, achieving significant reductions in word error rates over previous models.
Contribution
It introduces innovative modifications to LAS models, including multi-objective and multi-pass training, and external rescoring, leading to improved accuracy in multilingual voice search.
Findings
15.7% relative WER reduction over state-of-the-art LAS models
36.9% overall improvement over phoneme-CTC system
Effective tuning of LAS components for multilingual speech recognition
Abstract
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7% on top of state-of-the-art LAS models using these modifications. Overall, we report an improvement of 36.9% over the phoneme-CTC system. The paper also provides an overview of different components that can be tuned in a LAS-based system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
