Complex Word Identification: Challenges in Data Annotation and System   Performance

Marcos Zampieri; Shervin Malmasi; Gustavo Paetzold; Lucia Specia

arXiv:1710.04989·cs.CL·October 16, 2017·6 cites

Complex Word Identification: Challenges in Data Annotation and System Performance

Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, Lucia Specia

PDF

Open Access

TL;DR

This paper examines the challenges in complex word identification, highlighting issues with data annotation and system performance, and analyzing factors that make lexical complexity difficult to classify.

Contribution

It investigates the impact of annotation methods on CWI system performance using ensemble classifiers and provides insights into the challenges of lexical complexity detection.

Findings

01

Most systems performed poorly on the dataset

02

Annotation methods significantly affect classification accuracy

03

Understanding lexical complexity remains a challenge

Abstract

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling