Construction of a Large-scale Japanese ASR Corpus on TV Recordings
Shintaro Ando, Hiromasa Fujihara

TL;DR
This paper introduces a large-scale Japanese speech corpus derived from TV recordings, enabling improved training of ASR systems, and provides open access to the dataset and training scripts for research use.
Contribution
We developed a novel iterative workflow for extracting aligned speech and subtitles from TV recordings to create a large-scale Japanese ASR corpus.
Findings
Model trained on our corpus outperforms one trained on CSJ
The corpus improves ASR performance on Japanese TEDx videos
The dataset and scripts are publicly available
Abstract
This paper presents a new large-scale Japanese speech corpus for training automatic speech recognition (ASR) systems. This corpus contains over 2,000 hours of speech with transcripts built on Japanese TV recordings and their subtitles. We develop herein an iterative workflow to extract matching audio and subtitle segments from TV recordings based on a conventional method for lightly-supervised audio-to-text alignment. We evaluate a model trained with our corpus using an evaluation dataset built on Japanese TEDx presentation videos and confirm that the performance is better than that trained with the Corpus of Spontaneous Japanese (CSJ). The experiment results show the usefulness of our corpus for training ASR systems. This corpus is made public for the research community along with Kaldi scripts for training the models reported in this paper.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
