Cross Script Hindi English NER Corpus from Wikipedia
Mohd Zeeshan Ansari, Tanvir Ahmad, Md Arshad Ali

TL;DR
This paper introduces a new cross-script Hindi-English corpus from Wikipedia, annotated for NER, to facilitate research in mixed-lingual Indian language processing, showing promising results across machine learning models.
Contribution
It presents the first annotated cross-script Hindi-English NER corpus from Wikipedia, addressing the lack of standard datasets for mixed-lingual Indian NER research.
Findings
Successful corpus annotation using CoNLL-2003 categories
Effective evaluation across multiple machine learning algorithms
Favorable results demonstrating corpus utility
Abstract
The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
