TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Kai-Wei Chang; Yi-Cheng Lin; Huang-Cheng Chou; Wenze Ren; Yu-Han Huang; Yun-Shao Tsai; Chien-Cheng Chen; Yu Tsao; Yuan-Fu Liao; Shrikanth Narayanan; James Glass; Hung-yi Lee

arXiv:2603.21478·cs.CL·March 24, 2026

TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang, Yun-Shao Tsai, Chien-Cheng Chen, Yu Tsao, Yuan-Fu Liao, Shrikanth Narayanan, James Glass, Hung-yi Lee

PDF

Open Access 1 Datasets

TL;DR

TaigiSpeech introduces a real-world Taiwanese speech intent dataset for low-resource language applications, utilizing innovative data mining strategies to enable scalable dataset creation and preliminary intent detection results.

Contribution

The paper presents the first real-world Taiwanese speech intent dataset and explores novel data mining methods with minimal supervision for low-resource language data collection.

Findings

01

Successful collection of 3,000 utterances from older adults

02

Implementation of keyword match and multimodal data mining strategies

03

Preliminary intent detection results demonstrating feasibility

Abstract

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TaigiSpeech/TaigiSpeech
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · ICT in Developing Communities