Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation
Tommaso Pasini, Francesco Maria Elia, Roberto Navigli

TL;DR
This paper introduces six large-scale multilingual sense-annotated datasets for Word Sense Disambiguation, enabling supervised learning across languages and improving performance especially for low-resource languages.
Contribution
It provides the first extensive multilingual sense-annotated datasets covering all English nouns and their translations, facilitating supervised WSD research.
Findings
Datasets outperform previous low-resource language WSD systems.
Achieves competitive results for English WSD.
Enables supervised learning for multiple languages.
Abstract
We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences. Experiments prove that these corpora can be effectively used as training sets for supervised WSD systems, surpassing the state of the art for low-resourced languages and providing competitive results for English, where manually annotated training sets are accessible. The data is available at trainomatic.org.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
