ParaNames: A Massively Multilingual Entity Name Corpus

Jonne S\"alev\"a; Constantine Lignos

arXiv:2202.14035·cs.CL·July 13, 2022

ParaNames: A Massively Multilingual Entity Name Corpus

Jonne S\"alev\"a, Constantine Lignos

PDF

Open Access 1 Repo 1 Datasets

TL;DR

ParaNames is the largest multilingual entity name corpus with 118 million names across 400 languages, aiding multilingual NLP tasks like name translation, transliteration, and entity recognition.

Contribution

It introduces ParaNames, the largest parallel name resource with standardized data for 400 languages, enabling improved multilingual named entity processing.

Findings

01

Created the largest multilingual name corpus to date.

02

Demonstrated application in multilingual name translation.

03

Resource is publicly available under CC BY 4.0.

Abstract

We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released under a Creative Commons license (CC BY 4.0) at https://github.com/bltlab/paranames.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bltlab/paranames
noneOfficial

Datasets

imvladikon/paranames
dataset· 82 dl
82 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management