Pansori: ASR Corpus Generation from Open Online Video Contents

Yoona Choi; Bowon Lee

arXiv:1812.09798·eess.AS·December 27, 2018·5 cites

Pansori: ASR Corpus Generation from Open Online Video Contents

Yoona Choi, Bowon Lee

PDF

Open Access 3 Repos

TL;DR

Pansori is an open-source tool that automates the creation of multilingual ASR corpora from online videos, exemplified by the Korean Pansori-TEDxKR dataset, facilitating accessible speech recognition research.

Contribution

The paper presents Pansori, a novel open-source software for semi-automatically generating high-quality ASR corpora from online videos, including the first free Korean speech dataset.

Findings

01

Created the Pansori-TEDxKR Korean speech dataset

02

Demonstrated the effectiveness of Pansori in corpus generation

03

Released the tool and dataset for community use

Abstract

This paper introduces Pansori, a program used to create ASR (automatic speech recognition) corpora from online video contents. It utilizes a cloud-based speech API to easily create a corpus in different languages. Using this program, we semi-automatically generated the Pansori-TEDxKR dataset from Korean TED conference talks with community-transcribed subtitles. It is the first high-quality corpus for the Korean language freely available for independent research. Pansori is released as an open-source software and the generated corpus is released under a permissive public license for community use and participation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems