Predicting Lexical Complexity in English Texts: The Complex 2.0 Dataset
Matthew Shardlow, Richard Evans, Marcos Zampieri

TL;DR
This paper introduces the CompLex 2.0 dataset for lexical complexity prediction in English texts, demonstrating that Likert-scale annotations improve the identification of complex words and supporting advancements in text simplification.
Contribution
The paper develops a new annotation protocol and dataset for lexical complexity, enhancing the accuracy of complexity prediction models.
Findings
Likert-scale annotations outperform binary labels in identifying complex words.
The new dataset facilitates better training of lexical complexity prediction systems.
Analysis of datasets reveals properties influencing complexity classification.
Abstract
Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as Complex Word Identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Authorship Attribution and Profiling
