Cross Script Hindi English NER Corpus from Wikipedia

Mohd Zeeshan Ansari; Tanvir Ahmad; Md Arshad Ali

arXiv:1810.03430·cs.IR·October 9, 2018·1 cites

Cross Script Hindi English NER Corpus from Wikipedia

Mohd Zeeshan Ansari, Tanvir Ahmad, Md Arshad Ali

PDF

Open Access

TL;DR

This paper introduces a new cross-script Hindi-English corpus from Wikipedia, annotated for NER, to facilitate research in mixed-lingual Indian language processing, showing promising results across machine learning models.

Contribution

It presents the first annotated cross-script Hindi-English NER corpus from Wikipedia, addressing the lack of standard datasets for mixed-lingual Indian NER research.

Findings

01

Successful corpus annotation using CoNLL-2003 categories

02

Effective evaluation across multiple machine learning algorithms

03

Favorable results demonstrating corpus utility

Abstract

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification