Attention based end to end Speech Recognition for Voice Search in Hindi   and English

Raviraj Joshi; Venkateshan Kannan

arXiv:2111.10208·eess.AS·February 1, 2022

Attention based end to end Speech Recognition for Voice Search in Hindi and English

Raviraj Joshi, Venkateshan Kannan

PDF

Open Access

TL;DR

This paper enhances end-to-end speech recognition for Hindi and English voice search by improving attention mechanisms and training strategies, achieving significant reductions in word error rates over previous models.

Contribution

It introduces innovative modifications to LAS models, including multi-objective and multi-pass training, and external rescoring, leading to improved accuracy in multilingual voice search.

Findings

01

15.7% relative WER reduction over state-of-the-art LAS models

02

36.9% overall improvement over phoneme-CTC system

03

Effective tuning of LAS components for multilingual speech recognition

Abstract

We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7% on top of state-of-the-art LAS models using these modifications. Overall, we report an improvement of 36.9% over the phoneme-CTC system. The paper also provides an overview of different components that can be tuned in a LAS-based system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing