ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

Ebrahim Chekol Jibril; A. C\"uneyd Tant\u{g}

arXiv:2207.00785·cs.CL·July 5, 2022·1 cites

ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

Ebrahim Chekol Jibril, A. C\"uneyd Tant\u{g}

PDF

Open Access

TL;DR

This paper introduces a new Amharic named entity recognition dataset and a transformer-based system that achieves state-of-the-art accuracy, addressing challenges specific to Semitic languages.

Contribution

It provides the first annotated Amharic NER dataset and applies advanced techniques like SMOTE and LSTM-CRF for improved recognition performance.

Findings

01

Achieved 93% F1 score on Amharic NER

02

Created a new annotated dataset with 8,070 sentences

03

Applied SMOTE to handle class imbalance

Abstract

Named Entity Recognition is an information extraction task that serves as a preprocessing step for other natural language processing tasks, such as machine translation, information retrieval, and question answering. Named entity recognition enables the identification of proper names as well as temporal and numeric expressions in an open domain text. For Semitic languages such as Arabic, Amharic, and Hebrew, the named entity recognition task is more challenging due to the heavily inflected structure of these languages. In this paper, we present an Amharic named entity recognition system based on bidirectional long short-term memory with a conditional random fields layer. We annotate a new Amharic named entity recognition dataset (8,070 sentences, which has 182,691 tokens) and apply Synthetic Minority Over-sampling Technique to our dataset to mitigate the imbalanced classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies