Similarity-Based Models of Word Cooccurrence Probabilities

Ido Dagan; Lillian Lee; Fernando C. N. Pereira

arXiv:cs/9809110·cs.CL·May 23, 2007·70 cites

Similarity-Based Models of Word Cooccurrence Probabilities

Ido Dagan, Lillian Lee, Fernando C. N. Pereira

PDF

Open Access

TL;DR

This paper introduces a similarity-based approach to estimate probabilities of unseen word combinations in NLP, improving language modeling and disambiguation accuracy by leveraging distributional word similarities.

Contribution

It presents novel probabilistic models that use word similarity to better estimate probabilities of unseen word pairs, outperforming traditional methods.

Findings

01

20% perplexity reduction in language modeling

02

Significant speech recognition error reduction

03

Up to 40% improvement in disambiguation task

Abstract

In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis