Improving Reliability of Word Similarity Evaluation by Redesigning   Annotation Task and Performance Measure

Oded Avraham; Yoav Goldberg

arXiv:1611.03641·cs.CL·February 28, 2017·5 cites

Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure

Oded Avraham, Yoav Goldberg

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new approach for creating more reliable word similarity evaluation datasets by redesigning annotation tasks and developing a performance measure that accounts for annotation reliability.

Contribution

It proposes a novel annotation task design and a performance measure that together enhance the reliability of word similarity evaluations.

Findings

01

Higher inter-rater agreement achieved

02

Performance measure accounts for annotation reliability

03

Improved evaluation consistency

Abstract

We suggest a new method for creating and using gold-standard datasets for word similarity evaluation. Our goal is to improve the reliability of the evaluation, and we do this by redesigning the annotation task to achieve higher inter-rater agreement, and by defining a performance measure which takes the reliability of each annotation decision in the dataset into account.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oavraham1/ag-evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies