Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien
Sin-En Lu, Bo-Han Lu, Chao-Yi Lu, Richard Tzong-Han Tsai

TL;DR
This paper presents a new approach for constructing a Hokkien-Mandarin code-mixed dataset, developing a linguistic toolkit for segmentation, and adapting a cross-lingual model to improve translation quality in dialect code-mixing scenarios.
Contribution
It introduces a novel dataset construction method, a linguistic toolkit for Hokkien segmentation, and an adapted XLM model for better dialect code-mixed translation.
Findings
Linguistic knowledge improves translation quality.
The dataset enables better code-mixed NLP research.
Adapted XLM performs well on dialect code-mixing tasks.
Abstract
In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multilingual Education and Policy · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Adam · Softmax · Layer Normalization · Linear Layer · Dropout · Multi-Head Attention · Residual Connection
