Huge Automatically Extracted Training Sets for Multilingual Word Sense   Disambiguation

Tommaso Pasini; Francesco Maria Elia; Roberto Navigli

arXiv:1805.04685·cs.CL·May 15, 2018·6 cites

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

Tommaso Pasini, Francesco Maria Elia, Roberto Navigli

PDF

Open Access

TL;DR

This paper introduces six large-scale multilingual sense-annotated datasets for Word Sense Disambiguation, enabling supervised learning across languages and improving performance especially for low-resource languages.

Contribution

It provides the first extensive multilingual sense-annotated datasets covering all English nouns and their translations, facilitating supervised WSD research.

Findings

01

Datasets outperform previous low-resource language WSD systems.

02

Achieves competitive results for English WSD.

03

Enables supervised learning for multiple languages.

Abstract

We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences. Experiments prove that these corpora can be effectively used as training sets for supervised WSD systems, surpassing the state of the art for low-resourced languages and providing competitive results for English, where manually annotated training sets are accessible. The data is available at trainomatic.org.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification