Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

Nguyen Van Dinh; Thanh Chi Dang; Luan Thanh Nguyen; Kiet Van Nguyen

arXiv:2410.03458·cs.CL·October 7, 2024

Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

Nguyen Van Dinh, Thanh Chi Dang, Luan Thanh Nguyen, Kiet Van Nguyen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a comprehensive Vietnamese multi-dialect dataset with 63 provincial dialects, benchmarks speech recognition and dialect identification, and discusses challenges and implications for low-resource language processing.

Contribution

It presents the first fine-grained dataset of Vietnamese dialects, along with baseline models for dialect identification and speech recognition tasks.

Findings

01

Geographical factors influence dialect variations.

02

Current models face challenges with multi-dialect speech data.

03

The dataset enables future research in low-resource language dialects.

Abstract

Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nguyen-dv/ViMD_Dataset
noneOfficial

Videos

Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges· underline

Taxonomy

TopicsNatural Language Processing Techniques