CL-NERIL: A Cross-Lingual Model for NER in Indian Languages
Akshara Prabhakar, Gouri Sankar Majumder, Ashish Anand

TL;DR
This paper introduces CL-NERIL, a cross-lingual NER framework for Indian languages that leverages parallel corpora and a teacher-student model to improve performance in low-resource settings.
Contribution
It proposes a novel annotation projection method combined with a teacher-student model to enhance NER in Indian languages using weakly labeled data.
Findings
Minimum 10% performance improvement over zero-shot models
Effective use of weakly labeled data to supplement source language data
Framework applicable to multiple Indian languages
Abstract
Developing Named Entity Recognition (NER) systems for Indian languages has been a long-standing challenge, mainly owing to the requirement of a large amount of annotated clean training instances. This paper proposes an end-to-end framework for NER for Indian languages in a low-resource setting by exploiting parallel corpora of English and Indian languages and an English NER dataset. The proposed framework includes an annotation projection method that combines word alignment score and NER tag prediction confidence score on source language (English) data to generate weakly labeled data in a target Indian language. We employ a variant of the Teacher-Student model and optimize it jointly on the pseudo labels of the Teacher model and predictions on the generated weakly labeled data. We also present manually annotated test sets for three Indian languages: Hindi, Bengali, and Gujarati. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
