ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation
Khoa Anh Ta, Nguyen Van Dinh, Kiet Van Nguyen

TL;DR
This paper introduces ViDia2Std, a comprehensive parallel corpus for Vietnamese dialect-to-standard translation, covering all regions and dialects, and benchmarks models demonstrating the importance of dialect-aware NLP resources.
Contribution
The paper presents the first extensive, manually annotated dialect-to-standard Vietnamese corpus covering all regions, and evaluates models to improve dialect normalization in NLP.
Findings
ViDia2Std includes over 13,000 sentence pairs from all Vietnamese regions.
mBART-large-50 achieves the highest translation quality with BLEU 0.8166.
Dialect normalization significantly enhances downstream Vietnamese NLP tasks.
Abstract
Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
