CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data
Matthew Shardlow, Michael Cooper, Marcos Zampieri

TL;DR
This paper introduces CompLex, the first English corpus for continuous lexical complexity prediction using Likert scale annotations, enabling more nuanced understanding of word difficulty across different domains.
Contribution
It presents a novel dataset with Likert scale annotations for lexical complexity, addressing limitations of previous binary annotation schemes in CWI tasks.
Findings
First continuous lexical complexity dataset in English
Annotations from multiple domains including Bible, Europarl, biomedical
Provides a foundation for more nuanced NLP complexity prediction models
Abstract
Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
