Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

Hei Yi Mak; Tan Lee

arXiv:2505.17816·cs.CL·May 26, 2025

Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

Hei Yi Mak, Tan Lee

PDF

TL;DR

This paper develops a transformer-based neural machine translation system for Chinese to Cantonese, addressing data scarcity by mining additional parallel sentences, and demonstrates improved translation performance over existing services.

Contribution

It introduces an effective method for augmenting training data for low-resource Chinese-Cantonese translation using Wikipedia sentence mining.

Findings

01

Our system outperforms Baidu Fanyi in BLEU scores on most test sets.

02

Mining Wikipedia sentences significantly improves translation quality.

03

The system captures linguistic differences between Chinese and Cantonese.

Abstract

The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.