Similarity of Objects and the Meaning of Words
Rudi Cilibrasi (CWI), Paul Vitanyi (CWI, University of, Amsterdam)

TL;DR
This paper introduces universal, compression-based and web-based similarity measures for objects and their names, demonstrating their effectiveness in data mining and semantic analysis through large-scale experiments.
Contribution
It presents novel universal similarity distances for both literal objects and object names, unifying various measures using compression and web data.
Findings
Universal distance based on compression effectively measures similarity between literal objects.
Web-based similarity using Google page counts correlates with semantic relations.
Large-scale experiments support the viability of both approaches.
Abstract
We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like ``red'' or ``christianity.'' For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicslinguistics and terminology studies
