A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition
Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion, Jones

TL;DR
This paper compares various neural architectures and training methods for Japanese speech recognition, demonstrating state-of-the-art results with efficient models and advanced training techniques.
Contribution
It provides a comprehensive comparison of LSTM and Conformer models with different loss functions and training techniques for Japanese ASR, achieving new state-of-the-art performance.
Findings
Conformer models outperform LSTM in Japanese ASR.
Data augmentation and advanced training improve accuracy.
Conformer transducers are computationally efficient.
Abstract
End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
