Similarity-Based Models of Word Cooccurrence Probabilities
Ido Dagan, Lillian Lee, Fernando C. N. Pereira

TL;DR
This paper introduces a similarity-based approach to estimate probabilities of unseen word combinations in NLP, improving language modeling and disambiguation accuracy by leveraging distributional word similarities.
Contribution
It presents novel probabilistic models that use word similarity to better estimate probabilities of unseen word pairs, outperforming traditional methods.
Findings
20% perplexity reduction in language modeling
Significant speech recognition error reduction
Up to 40% improvement in disambiguation task
Abstract
In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
