Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure
Oded Avraham, Yoav Goldberg

TL;DR
This paper introduces a new approach for creating more reliable word similarity evaluation datasets by redesigning annotation tasks and developing a performance measure that accounts for annotation reliability.
Contribution
It proposes a novel annotation task design and a performance measure that together enhance the reliability of word similarity evaluations.
Findings
Higher inter-rater agreement achieved
Performance measure accounts for annotation reliability
Improved evaluation consistency
Abstract
We suggest a new method for creating and using gold-standard datasets for word similarity evaluation. Our goal is to improve the reliability of the evaluation, and we do this by redesigning the annotation task to achieve higher inter-rater agreement, and by defining a performance measure which takes the reliability of each annotation decision in the dataset into account.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
