Pansori: ASR Corpus Generation from Open Online Video Contents
Yoona Choi, Bowon Lee

TL;DR
Pansori is an open-source tool that automates the creation of multilingual ASR corpora from online videos, exemplified by the Korean Pansori-TEDxKR dataset, facilitating accessible speech recognition research.
Contribution
The paper presents Pansori, a novel open-source software for semi-automatically generating high-quality ASR corpora from online videos, including the first free Korean speech dataset.
Findings
Created the Pansori-TEDxKR Korean speech dataset
Demonstrated the effectiveness of Pansori in corpus generation
Released the tool and dataset for community use
Abstract
This paper introduces Pansori, a program used to create ASR (automatic speech recognition) corpora from online video contents. It utilizes a cloud-based speech API to easily create a corpus in different languages. Using this program, we semi-automatically generated the Pansori-TEDxKR dataset from Korean TED conference talks with community-transcribed subtitles. It is the first high-quality corpus for the Korean language freely available for independent research. Pansori is released as an open-source software and the generated corpus is released under a permissive public license for community use and participation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
