AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition
Dhrubajyoti Pathak, Sukumar Nandi, Priyankoo Sarmah

TL;DR
This paper introduces AsNER, a new annotated dataset for Assamese NER, along with baseline models demonstrating promising results, which can significantly advance NLP research for the low-resource Assamese language.
Contribution
The paper provides the first large-scale Assamese NER dataset and benchmarks multiple state-of-the-art models, establishing a foundation for future research in low-resource language processing.
Findings
MuRIL-based model achieved 80.69% F1-score.
The dataset includes 99k tokens from diverse sources.
Baseline models demonstrate the dataset's utility for Assamese NER.
Abstract
We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsAttention Is All You Need · XLM-R · Linear Layer · Dropout · Linear Warmup With Linear Decay · Adam · Layer Normalization · Weight Decay · WordPiece · Softmax
