Sentence-level dialects identification in the greater China region
Fan Xu, Mingwen Wang, Maoxi Li

TL;DR
This paper presents a novel approach for identifying dialects of Mandarin Chinese within the Greater China Region, utilizing new word-level features to improve accuracy over traditional character or word uni-gram methods.
Contribution
The paper introduces new word-level features, including PMI-based and word alignment-based features, to enhance dialect identification accuracy in Mandarin Chinese.
Findings
Character-level uni-gram features are insufficient for dialect discrimination.
Word-level features, especially PMI-based and alignment-based, improve performance.
Evaluation on Wikipedia datasets confirms the effectiveness of the proposed method.
Abstract
Identifying the different varieties of the same language is more challenging than unrelated languages identification. In this paper, we propose an approach to discriminate language varieties or dialects of Mandarin Chinese for the Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore, a.k.a., the Greater China Region (GCR). When applied to the dialects identification of the GCR, we find that the commonly used character-level or word-level uni-gram feature is not very efficient since there exist several specific problems such as the ambiguity and context-dependent characteristic of words in the dialects of the GCR. To overcome these challenges, we use not only the general features like character-level n-gram, but also many new word-level features, including PMI-based and word alignment-based features. A series of evaluation results on both the news and open-domain dataset from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Linguistic Variation and Morphology
