Unified End-to-End Speech Recognition and Endpointing for Fast and   Efficient Speech Systems

Shaan Bijwadia; Shuo-yiin Chang; Bo Li; Tara Sainath; Chao Zhang,; Yanzhang He

arXiv:2211.00786·cs.SD·February 16, 2023

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Shaan Bijwadia, Shuo-yiin Chang, Bo Li, Tara Sainath, Chao Zhang,, Yanzhang He

PDF

Open Access

TL;DR

This paper introduces a unified end-to-end model for speech recognition and endpointing that reduces latency and improves accuracy by jointly training both tasks and leveraging shared representations.

Contribution

The authors propose a novel multitask E2E model with a switch connection that jointly trains ASR and endpointing, enhancing efficiency and performance over separate models.

Findings

01

Reduces median endpoint latency by 30.8%.

02

Decreases 90th percentile latency by 23.0%.

03

Improves word error rate by 10.6% relative.

Abstract

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest