Back-Translation-Style Data Augmentation for End-to-End ASR

Tomoki Hayashi; Shinji Watanabe; Yu Zhang; Tomoki Toda; Takaaki Hori,; Ramon Astudillo; Kazuya Takeda

arXiv:1807.10893·cs.CL·July 31, 2018·5 cites

Back-Translation-Style Data Augmentation for End-to-End ASR

Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori,, Ramon Astudillo, Kazuya Takeda

PDF

Open Access

TL;DR

This paper introduces a novel data augmentation technique for end-to-end speech recognition that leverages unpaired text data through a neural text-to-encoder model, improving performance and reducing unknown words.

Contribution

The paper proposes a back-translation-style augmentation method using hidden states predicted from unpaired text, enhancing E2E-ASR training efficiency and accuracy.

Findings

01

Improved ASR performance on LibriSpeech dataset

02

Reduced number of unknown words

03

Effective use of unpaired text data for augmentation

Abstract

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling