\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding

Junghyun Min; York Hay Ng; Sophia Chan; Helena Shunhua Zhao; En-Shiun Annie Lee

arXiv:2510.20670·cs.CL·October 24, 2025

\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding

Junghyun Min, York Hay Ng, Sophia Chan, Helena Shunhua Zhao, En-Shiun Annie Lee

PDF

Open Access

TL;DR

This paper introduces CantoNLU, a comprehensive benchmark for Cantonese NLP tasks, providing datasets, baseline models, and analysis to advance research in this under-resourced language.

Contribution

It presents the first multi-task Cantonese NLU benchmark, along with baseline models and analysis, to support future NLP research in Cantonese.

Findings

01

Cantonese-adapted models outperform other models overall.

02

Monolingual Cantonese models excel at syntactic tasks.

03

Mandarin models remain competitive with limited Cantonese data.

Abstract

Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification