Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy
Jay J. Jiang (University of Waterloo), David W. Conrath (McMaster, University)

TL;DR
This paper introduces a novel semantic similarity measure that integrates lexical taxonomy with corpus statistics, improving correlation with human judgments of word similarity.
Contribution
It combines edge-based and node-based methods to enhance semantic distance measurement using corpus data and taxonomy structure.
Findings
Achieves highest correlation (r=0.828) with human similarity ratings
Outperforms other computational models on the same dataset
Approaches the upper bound of human consistency (r=0.885)
Abstract
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
