Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy

Jay J. Jiang (University of Waterloo); David W. Conrath (McMaster; University)

arXiv:cmp-lg/9709008·cmp-lg·February 3, 2008·2.2k cites

Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy

Jay J. Jiang (University of Waterloo), David W. Conrath (McMaster, University)

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel semantic similarity measure that integrates lexical taxonomy with corpus statistics, improving correlation with human judgments of word similarity.

Contribution

It combines edge-based and node-based methods to enhance semantic distance measurement using corpus data and taxonomy structure.

Findings

01

Achieves highest correlation (r=0.828) with human similarity ratings

02

Outperforms other computational models on the same dataset

03

Approaches the upper bound of human consistency (r=0.885)

Abstract

This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

usc-isi-i2/kgtk-similarity
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies