CCAE: A Corpus of Chinese-based Asian Englishes
Yang Liu, Melissa Xiaohui Qin, Long Wang, and Chao Huang

TL;DR
This paper introduces CCAE, a large-scale corpus of Chinese-based Asian English varieties, enabling NLP research on World Englishes and Chinese Englishes, with preliminary experiments demonstrating its practical utility.
Contribution
It creates the first publicly accessible multi-variety corpus of Chinese-based Asian Englishes, facilitating NLP studies in language variation and downstream applications.
Findings
CCAE contains 340 million tokens from six regions.
Preliminary experiments show the corpus's usefulness for language modeling.
The corpus supports research on Asian Englishes and Chinese Englishes.
Abstract
Language models have been foundations in various scenarios of NLP applications, but it has not been well applied in language variety studies, even for the most popular language like English. This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes, specifically in creating a multi-variety corpus for studying Asian Englishes. We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties. It is based on 340 million tokens in 448 thousand web documents from six regions. The ontology of data would make the corpus a helpful resource with enormous research potential for Asian Englishes (especially for Chinese Englishes for which there has not been a publicly accessible corpus yet so far) and an ideal source for variety-specific language modeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsOntology
