Word Clustering and Disambiguation Based on Co-occurrence Data
Hang Li, Naoki Abe (NEC Corporation)

TL;DR
This paper presents an efficient word clustering and disambiguation method based on co-occurrence data and the MDL principle, achieving higher accuracy than previous approaches by combining automatic and manual thesauruses.
Contribution
It introduces a novel MDL-based algorithm for word clustering and integrates it with disambiguation techniques, improving accuracy over prior methods.
Findings
Disambiguation accuracy of 85.2% with the proposed method.
Outperforms the previous state-of-the-art accuracy of 82.4%.
Effective combination of automatic and manual thesauruses.
Abstract
We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs, such as noun verb pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability distribution. Our method is a natural extension of those proposed in (Brown et al 92) and (Li & Abe 96), and overcomes their drawbacks while retaining their advantages. We then combined this clustering method with the disambiguation method of (Li & Abe 95) to derive a disambiguation method that makes use of both automatically constructed thesauruses and a hand-made thesaurus. The overall disambiguation accuracy achieved by our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
