Building Multilingual Corpora for a Complex Named Entity Recognition and   Classification Hierarchy using Wikipedia and DBpedia

Diego Alves; Gaurish Thakkar; Gabriel Amaral; Tin Kuculo; Marko; Tadi\'c

arXiv:2212.07429·cs.CL·December 16, 2022

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Diego Alves, Gaurish Thakkar, Gabriel Amaral, Tin Kuculo, Marko, Tadi\'c

PDF

Open Access

TL;DR

This paper introduces the UNER dataset, a multilingual, hierarchical named-entity corpus created using Wikipedia and DBpedia, enabling improved NER in low-resource languages through a detailed, scalable extraction and annotation process.

Contribution

The paper presents a novel, scalable method for constructing multilingual, hierarchical NER datasets using Wikipedia and DBpedia, applicable to any language available on Wikipedia.

Findings

01

Created the UNER multilingual NER dataset

02

Developed a three-step extraction and linking procedure

03

Significantly increased entity detection through post-processing

Abstract

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques