How big is big enough? Unsupervised word sense disambiguation using a   very large corpus

Piotr Przyby{\l}a

arXiv:1710.07960·cs.CL·October 24, 2017·2 cites

How big is big enough? Unsupervised word sense disambiguation using a very large corpus

Piotr Przyby{\l}a

PDF

Open Access

TL;DR

This paper explores unsupervised word sense disambiguation for Polish using a massive corpus, leveraging related words and heuristics to improve accuracy with a modified Bayesian classifier.

Contribution

It introduces new heuristics based on WordNet relations and evaluates the impact of training set size on disambiguation performance using an unprecedentedly large corpus.

Findings

01

Disambiguation accuracy improves with larger training data.

02

Rich sources of replacements enhance disambiguation performance.

03

Modified Bayesian classifier effectively handles sense distribution uncertainty.

Abstract

In this paper, the problem of disambiguating a target word for Polish is approached by searching for related words with known meaning. These relatives are used to build a training corpus from unannotated text. This technique is improved by proposing new rich sources of replacements that substitute the traditional requirement of monosemy with heuristics based on wordnet relations. The na\"ive Bayesian classifier has been modified to account for an unknown distribution of senses. A corpus of 600 million web documents (594 billion tokens), gathered by the NEKST search engine allows us to assess the relationship between training set size and disambiguation accuracy. The classifier is evaluated using both a wordnet baseline and a corpus with 17,314 manually annotated occurrences of 54 ambiguous words.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems