\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
Junghyun Min, York Hay Ng, Sophia Chan, Helena Shunhua Zhao, En-Shiun Annie Lee

TL;DR
This paper introduces CantoNLU, a comprehensive benchmark for Cantonese NLP tasks, providing datasets, baseline models, and analysis to advance research in this under-resourced language.
Contribution
It presents the first multi-task Cantonese NLU benchmark, along with baseline models and analysis, to support future NLP research in Cantonese.
Findings
Cantonese-adapted models outperform other models overall.
Monolingual Cantonese models excel at syntactic tasks.
Mandarin models remain competitive with limited Cantonese data.
Abstract
Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
