Data Augmentation for Biomedical Factoid Question Answering
Dimitris Pappas, Prodromos Malakasiotis, Ion Androutsopoulos

TL;DR
This paper evaluates seven data augmentation techniques for biomedical factoid question answering, demonstrating significant performance improvements, especially with simple word2vec-based substitution, and discusses their effectiveness with large pre-trained models.
Contribution
It systematically compares various data augmentation methods in biomedical QA, highlighting the effectiveness of simple word2vec-based substitution and providing resources for future research.
Findings
Word2vec-based substitution outperformed other methods.
Data augmentation significantly improves biomedical QA performance.
Simple augmentation techniques are highly effective with large models.
Abstract
We study the effect of seven data augmentation (da) methods in factoid question answering, focusing on the biomedical domain, where obtaining training instances is particularly difficult. We experiment with data from the BioASQ challenge, which we augment with training instances obtained from an artificial biomedical machine reading comprehension dataset, or via back-translation, information retrieval, word substitution based on word2vec embeddings, or masked language modeling, question generation, or extending the given passage with additional context. We show that da can lead to very significant performance gains, even when using large pre-trained Transformers, contributing to a broader discussion of if/when da benefits large pre-trained models. One of the simplest da methods, word2vec-based word substitution, performed best and is recommended. We release our artificial training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
