CCAE: A Corpus of Chinese-based Asian Englishes

Yang Liu; Melissa Xiaohui Qin; Long Wang; and Chao Huang

arXiv:2310.05381·cs.CL·October 10, 2023

CCAE: A Corpus of Chinese-based Asian Englishes

Yang Liu, Melissa Xiaohui Qin, Long Wang, and Chao Huang

PDF

1 Datasets

TL;DR

This paper introduces CCAE, a large-scale corpus of Chinese-based Asian English varieties, enabling NLP research on World Englishes and Chinese Englishes, with preliminary experiments demonstrating its practical utility.

Contribution

It creates the first publicly accessible multi-variety corpus of Chinese-based Asian Englishes, facilitating NLP studies in language variation and downstream applications.

Findings

01

CCAE contains 340 million tokens from six regions.

02

Preliminary experiments show the corpus's usefulness for language modeling.

03

The corpus supports research on Asian Englishes and Chinese Englishes.

Abstract

Language models have been foundations in various scenarios of NLP applications, but it has not been well applied in language variety studies, even for the most popular language like English. This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes, specifically in creating a multi-variety corpus for studying Asian Englishes. We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties. It is based on 340 million tokens in 448 thousand web documents from six regions. The ontology of data would make the corpus a helpful resource with enormous research potential for Asian Englishes (especially for Chinese Englishes for which there has not been a publicly accessible corpus yet so far) and an ideal source for variety-specific language modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CCAE/CCAE-Corpus
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsOntology