Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set
Hassan S. Shavarani, Satoshi Sekine

TL;DR
This paper introduces a large multilingual Wikipedia article dataset with fine-grained entity tags and evaluates classification models, revealing challenges in handling extensive, detailed label sets across multiple languages.
Contribution
The creation of the SHINRA-5LDS dataset with multilingual, multi-labeled Wikipedia articles using an extended named entity tag set is a novel resource for NLP research.
Findings
Current models struggle with large, fine-grained label sets.
Multilingual dataset enables cross-lingual classification studies.
Evaluation highlights the need for improved models for detailed categorization.
Abstract
Wikipedia is a great source of general world knowledge which can guide NLP models better understand their motivation to make predictions. Structuring Wikipedia is the initial step towards this goal which can facilitate fine-grain classification of articles. In this work, we introduce the Shinra 5-Language Categorization Dataset (SHINRA-5LDS), a large multi-lingual and multi-labeled set of annotated Wikipedia articles in Japanese, English, French, German, and Farsi using Extended Named Entity (ENE) tag set. We evaluate the dataset using the best models provided for ENE label set classification and show that the currently available classification models struggle with large datasets using fine-grained tag sets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Topic Modeling · Natural Language Processing Techniques
