Complex Word Identification: Challenges in Data Annotation and System Performance
Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, Lucia Specia

TL;DR
This paper examines the challenges in complex word identification, highlighting issues with data annotation and system performance, and analyzing factors that make lexical complexity difficult to classify.
Contribution
It investigates the impact of annotation methods on CWI system performance using ensemble classifiers and provides insights into the challenges of lexical complexity detection.
Findings
Most systems performed poorly on the dataset
Annotation methods significantly affect classification accuracy
Understanding lexical complexity remains a challenge
Abstract
This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
