Multi-class Multilingual Classification of Wikipedia Articles Using   Extended Named Entity Tag Set

Hassan S. Shavarani; Satoshi Sekine

arXiv:1909.06502·cs.CL·March 9, 2020·1 cites

Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set

Hassan S. Shavarani, Satoshi Sekine

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large multilingual Wikipedia article dataset with fine-grained entity tags and evaluates classification models, revealing challenges in handling extensive, detailed label sets across multiple languages.

Contribution

The creation of the SHINRA-5LDS dataset with multilingual, multi-labeled Wikipedia articles using an extended named entity tag set is a novel resource for NLP research.

Findings

01

Current models struggle with large, fine-grained label sets.

02

Multilingual dataset enables cross-lingual classification studies.

03

Evaluation highlights the need for improved models for detailed categorization.

Abstract

Wikipedia is a great source of general world knowledge which can guide NLP models better understand their motivation to make predictions. Structuring Wikipedia is the initial step towards this goal which can facilitate fine-grain classification of articles. In this work, we introduce the Shinra 5-Language Categorization Dataset (SHINRA-5LDS), a large multi-lingual and multi-labeled set of annotated Wikipedia articles in Japanese, English, French, German, and Farsi using Extended Named Entity (ENE) tag set. We evaluate the dataset using the best models provided for ENE label set classification and show that the currently available classification models struggle with large datasets using fine-grained tag sets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

shavarani/SHINRA-5LDS
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWikis in Education and Collaboration · Topic Modeling · Natural Language Processing Techniques