TL;DR
This paper introduces the first manually-annotated COVID-19 NER dataset for Vietnamese, demonstrating that fine-tuning pre-trained language models, especially PhoBERT, significantly improves NER performance in this language.
Contribution
It creates the first COVID-19 Vietnamese NER dataset with new entity types and the largest number of entities, and evaluates the effectiveness of language models and segmentation techniques.
Findings
Pre-trained PhoBERT outperforms XLM-R in Vietnamese NER.
Automatic word segmentation improves NER results.
The dataset is publicly available for future research.
Abstract
The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsXLM-R
