Cross-corpus Readability Compatibility Assessment for English Texts
Zhenzhen Li, Han Ding, Shaohong Zhang

TL;DR
This paper introduces a new framework for assessing the compatibility of different corpora in readability assessment, utilizing multiple datasets, features, and models to ensure robustness and facilitate cross-corpus transfer learning.
Contribution
It proposes the Cross-corpus text Readability Compatibility Assessment (CRCA) framework, combining linguistic features, word vectors, and various models with new compatibility metrics to evaluate corpus similarity.
Findings
OSP corpus is significantly different from others.
Compatibility varies with features and models, showing an adaptation effect.
Metrics are consistent, validating the framework's robustness.
Abstract
Text readability assessment has gained significant attention from researchers in various domains. However, the lack of exploration into corpus compatibility poses a challenge as different research groups utilize different corpora. In this study, we propose a novel evaluation framework, Cross-corpus text Readability Compatibility Assessment (CRCA), to address this issue. The framework encompasses three key components: (1) Corpus: CEFR, CLEC, CLOTH, NES, OSP, and RACE. Linguistic features, GloVe word vector representations, and their fusion features were extracted. (2) Classification models: Machine learning methods (XGBoost, SVM) and deep learning methods (BiLSTM, Attention-BiLSTM) were employed. (3) Compatibility metrics: RJSD, RRNSS, and NDCG metrics. Our findings revealed: (1) Validated corpus compatibility, with OSP standing out as significantly different from other datasets. (2) An…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques
MethodsGloVe Embeddings
