Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation
Long Phan, Tai Dang, Hieu Tran, Trieu H. Trinh, Vy Phan, Lam D. Chau,, and Minh-Thang Luong

TL;DR
This paper leverages large-scale translation to create biomedical NLP resources in Vietnamese, training a new model that achieves state-of-the-art results on key benchmarks and introducing a new Vietnamese biomedical NLP task.
Contribution
It introduces ViPubmedT5, a pretrained model trained on 20 million translated biomedical abstracts, and presents ViMedNLI, a new Vietnamese biomedical NLP dataset and task.
Findings
ViPubmedT5 achieves state-of-the-art results on biomedical benchmarks.
Large-scale translation effectively enriches low-resource language biomedical data.
ViMedNLI provides a new benchmark for Vietnamese biomedical NLP.
Abstract
Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam · Dense Connections · Softmax · Label Smoothing
