Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic   Annotations

Linrong Pan; Chenglong Jiang; Gaoze Hou; Ying Gao

arXiv:2505.05056·cs.CL·May 9, 2025

Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Linrong Pan, Chenglong Jiang, Gaoze Hou, Ying Gao

PDF

Open Access

TL;DR

This paper introduces Teochew-Wild, the first in-the-wild Teochew speech dataset with orthographic annotations, supporting research in speech recognition and synthesis for this low-resource dialect.

Contribution

It presents the first publicly available Teochew speech corpus with detailed orthographic and pinyin annotations, along with tools to facilitate speech technology research.

Findings

01

Effective for ASR tasks

02

Supports TTS development

03

Validates dataset usefulness

Abstract

This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew speech data from multiple speakers, covering both formal and colloquial expressions, with precise orthographic and pinyin annotations. Additionally, we provide supplementary text processing tools and resources to propel research and applications in speech tasks for this low-resource language, such as automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our knowledge, this is the first publicly available Teochew dataset with accurate orthographic annotations. We conduct experiments on the corpus, and the results validate its effectiveness in ASR and TTS tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Authorship Attribution and Profiling