A Short Survey on Sense-Annotated Corpora

Tommaso Pasini; Jose Camacho-Collados

arXiv:1802.04744·cs.CL·March 16, 2020

A Short Survey on Sense-Annotated Corpora

Tommaso Pasini, Jose Camacho-Collados

PDF

Open Access

TL;DR

This survey reviews available sense-annotated corpora across multiple languages, highlighting manual and automatic annotation methods, dataset features, and their role in advancing Word Sense Disambiguation systems.

Contribution

It provides a comprehensive overview of existing sense-annotated datasets, comparing their features, annotation methods, and lexical resources used.

Findings

01

Most datasets are manually or semi-automatically annotated.

02

Datasets vary in size, language, and lexical resources.

03

Automatic methods are increasingly used to expand datasets.

Abstract

Large sense-annotated datasets are increasingly necessary for training deep supervised systems in Word Sense Disambiguation. However, gathering high-quality sense-annotated data for as many instances as possible is a laborious and expensive task. This has led to the proliferation of automatic and semi-automatic methods for overcoming the so-called knowledge-acquisition bottleneck. In this short survey we present an overview of sense-annotated corpora, annotated either manually- or (semi)automatically, that are currently available for different languages and featuring distinct lexical resources as inventory of senses, i.e. WordNet, Wikipedia, BabelNet. Furthermore, we provide the reader with general statistics of each dataset and an analysis of their specific features.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification