End-to-End Speech Recognition with High-Frame-Rate Features Extraction

Cong-Thanh Do

arXiv:1907.01957·eess.AS·July 15, 2019·1 cites

End-to-End Speech Recognition with High-Frame-Rate Features Extraction

Cong-Thanh Do

PDF

Open Access

TL;DR

This paper explores the use of high-frame-rate feature extraction at 200 and 400 frames/sec in end-to-end speech recognition, demonstrating significant WER improvements on WSJ and CHiME-5 datasets.

Contribution

It introduces high-frame-rate feature extraction for end-to-end ASR and evaluates its effectiveness, showing notable performance gains over standard 100 fps features.

Findings

01

Up to 21.3% relative WER reduction on WSJ

02

Up to 11.8% relative WER reduction on CHiME-5 binaural data

03

High-frame-rate features improve ASR performance independently and with data augmentation

Abstract

State-of-the-art end-to-end automatic speech recognition (ASR) extracts acoustic features from input speech signal every 10 ms which corresponds to a frame rate of 100 frames/second. In this report, we investigate the use of high-frame-rate features extraction in end-to-end ASR. High frame rates of 200 and 400 frames/second are used in the features extraction and provide additional information for end-to-end ASR. The effectiveness of high-frame-rate features extraction is evaluated independently and in combination with speed perturbation based data augmentation. Experiments performed on two speech corpora, Wall Street Journal (WSJ) and CHiME-5, show that using high-frame-rate features extraction yields improved performance for end-to-end ASR, both independently and in combination with speed perturbation. On WSJ corpus, the relative reduction of word error rate (WER) yielded by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings