Evaluating semantic models with word-sentence relatedness
Kimberly Glasgow, Matthew Roos, Amy Haufler, Mark Chevillet, Michael, Wolmetz

TL;DR
This paper introduces a dataset of human-annotated semantic relatedness for word-sentence pairs and evaluates how well various semantic models align with human judgments, highlighting their strengths and limitations.
Contribution
It provides a new dataset for evaluating semantic models with human judgments and compares multiple models' performance against this benchmark.
Findings
Some models captured variance in human judgments
Models lacked sensitivity to implicatures and entailments
Data and stimuli are publicly available
Abstract
Semantic textual similarity (STS) systems are designed to encode and evaluate the semantic similarity between words, phrases, sentences, and documents. One method for assessing the quality or authenticity of semantic information encoded in these systems is by comparison with human judgments. A data set for evaluating semantic models was developed consisting of 775 English word-sentence pairs, each annotated for semantic relatedness by human raters engaged in a Maximum Difference Scaling (MDS) task, as well as a faster alternative task. As a sample application of this relatedness data, behavior-based relatedness was compared to the relatedness computed via four off-the-shelf STS models: n-gram, Latent Semantic Analysis (LSA), Word2Vec, and UMBC Ebiquity. Some STS models captured much of the variance in the human judgments collected, but they were not sensitive to the implicatures and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
