The Google Similarity Distance
Rudi Cilibrasi (CWI), Paul M. B. Vitanyi (CWI, University of, Amsterdam)

TL;DR
This paper introduces the Google similarity distance, a new method for measuring semantic similarity between words and phrases using web page counts from Google, enabling applications like clustering, classification, and translation.
Contribution
The paper presents a novel similarity measure based on information distance and Kolmogorov complexity, applied to web data, with demonstrated effectiveness in various NLP tasks.
Findings
Achieved 87% agreement with WordNet categories in classification tasks.
Successfully distinguished colors, numbers, and artist names.
Demonstrated basic automatic language translation.
Abstract
Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Algorithms and Data Compression · Advanced Text Analysis Techniques
