A Comparative Study on Neural Architectures and Training Methods for   Japanese Speech Recognition

Shigeki Karita; Yotaro Kubo; Michiel Adriaan Unico Bacchiani; Llion; Jones

arXiv:2106.05111·cs.CL·June 10, 2021

A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion, Jones

PDF

TL;DR

This paper compares various neural architectures and training methods for Japanese speech recognition, demonstrating state-of-the-art results with efficient models and advanced training techniques.

Contribution

It provides a comprehensive comparison of LSTM and Conformer models with different loss functions and training techniques for Japanese ASR, achieving new state-of-the-art performance.

Findings

01

Conformer models outperform LSTM in Japanese ASR.

02

Data augmentation and advanced training improve accuracy.

03

Conformer transducers are computationally efficient.

Abstract

End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.