Exploring the Potential of Machine Translation for Generating Named Entity Datasets: A Case Study between Persian and English
Amir Sartipi, Afsaneh Fatemi

TL;DR
This paper investigates using machine translation to generate Persian named entity datasets from English data, demonstrating promising results and potential for improving low-resource language NER systems.
Contribution
It introduces a novel approach of leveraging machine translation to create Persian NER datasets and evaluates their effectiveness with transformer models.
Findings
Highest F1 score of 85.11% on CoNLL 2003 dataset
Lower F1 score of 40.02% on WNUT 2017 dataset
Machine translation can augment datasets for low-resource languages
Abstract
This study focuses on the generation of Persian named entity datasets through the application of machine translation on English datasets. The generated datasets were evaluated by experimenting with one monolingual and one multilingual transformer model. Notably, the CoNLL 2003 dataset has achieved the highest F1 score of 85.11%. In contrast, the WNUT 2017 dataset yielded the lowest F1 score of 40.02%. The results of this study highlight the potential of machine translation in creating high-quality named entity recognition datasets for low-resource languages like Persian. The study compares the performance of these generated datasets with English named entity recognition systems and provides insights into the effectiveness of machine translation for this task. Additionally, this approach could be used to augment data in low-resource language or create noisy data to make named entity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Amir13/bert-base-parsbert-uncased-wnut2017model· 2 dl2 dl
- 🤗Amir13/bert-base-parsbert-uncased-ontonotesv5model· 2 dl2 dl
- 🤗Amir13/bert-base-parsbert-uncased-ncbi_diseasemodel· 2 dl2 dl
- 🤗Amir13/xlm-roberta-base-ncbi_diseasemodel· 1 dl1 dl
- 🤗Amir13/bert-base-parsbert-uncased-conll2003model· 2 dl2 dl
- 🤗Amir13/xlm-roberta-base-wnut2017model· 2 dl2 dl
- 🤗Amir13/xlm-roberta-base-conll2003-enmodel· 3 dl3 dl
- 🤗Amir13/xlm-roberta-base-ncbi_disease-enmodel· 1 dl1 dl
- 🤗Amir13/xlm-roberta-base-conll2003model· 2 dl2 dl
- 🤗Amir13/xlm-roberta-base-wnut2017-enmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
