ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using   Wikidata

Jonne S\"alev\"a; Constantine Lignos

arXiv:2405.09496·cs.CL·May 16, 2024

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Jonne S\"alev\"a, Constantine Lignos

PDF

Open Access 1 Repo 1 Datasets

TL;DR

ParaNames 1.0 is a comprehensive multilingual entity name corpus derived from Wikidata, covering over 400 languages and 16.8 million entities, enhancing multilingual NLP tasks like translation and NER.

Contribution

This paper introduces ParaNames, the largest multilingual name resource from Wikidata, with a standardized hierarchy and demonstrated utility in translation and named entity recognition tasks.

Findings

01

Improved translation accuracy across 17 languages.

02

Enhanced NER performance on 10 languages.

03

Largest resource of its kind to date.

Abstract

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bltlab/paranames
noneOfficial

Datasets

bltlab/ParaNames
dataset· 76 dl
76 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling