Construction of a Large-scale Japanese ASR Corpus on TV Recordings

Shintaro Ando; Hiromasa Fujihara

arXiv:2103.14736·cs.SD·March 30, 2021

Construction of a Large-scale Japanese ASR Corpus on TV Recordings

Shintaro Ando, Hiromasa Fujihara

PDF

TL;DR

This paper introduces a large-scale Japanese speech corpus derived from TV recordings, enabling improved training of ASR systems, and provides open access to the dataset and training scripts for research use.

Contribution

We developed a novel iterative workflow for extracting aligned speech and subtitles from TV recordings to create a large-scale Japanese ASR corpus.

Findings

01

Model trained on our corpus outperforms one trained on CSJ

02

The corpus improves ASR performance on Japanese TEDx videos

03

The dataset and scripts are publicly available

Abstract

This paper presents a new large-scale Japanese speech corpus for training automatic speech recognition (ASR) systems. This corpus contains over 2,000 hours of speech with transcripts built on Japanese TV recordings and their subtitles. We develop herein an iterative workflow to extract matching audio and subtitle segments from TV recordings based on a conventional method for lightly-supervised audio-to-text alignment. We evaluate a model trained with our corpus using an evaluation dataset built on Japanese TEDx presentation videos and confirm that the performance is better than that trained with the Corpus of Spontaneous Japanese (CSJ). The experiment results show the usefulness of our corpus for training ASR systems. This corpus is made public for the research community along with Kaldi scripts for training the models reported in this paper.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.