COVID-19 Named Entity Recognition for Vietnamese

Thinh Hung Truong; Mai Hoang Dao; Dat Quoc Nguyen

arXiv:2104.03879·cs.CL·April 9, 2021

COVID-19 Named Entity Recognition for Vietnamese

Thinh Hung Truong, Mai Hoang Dao, Dat Quoc Nguyen

PDF

1 Repo

TL;DR

This paper introduces the first manually-annotated COVID-19 NER dataset for Vietnamese, demonstrating that fine-tuning pre-trained language models, especially PhoBERT, significantly improves NER performance in this language.

Contribution

It creates the first COVID-19 Vietnamese NER dataset with new entity types and the largest number of entities, and evaluates the effectiveness of language models and segmentation techniques.

Findings

01

Pre-trained PhoBERT outperforms XLM-R in Vietnamese NER.

02

Automatic word segmentation improves NER results.

03

The dataset is publicly available for future research.

Abstract

The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VinAIResearch/PhoNER_COVID19
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsXLM-R