An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering
Shayne Longpre, Yi Lu, Zhucheng Tu, Chris DuBois

TL;DR
This paper investigates data augmentation and sampling strategies, including negative sampling and paraphrasing, to improve domain-agnostic question answering models, achieving high performance in the MRQA 2019 Shared Task.
Contribution
It introduces effective data sampling and augmentation techniques, notably negative sampling, for enhancing domain-agnostic question answering models using pre-trained language models.
Findings
Negative sampling significantly improves model performance.
Per-domain sampling combined with negative sampling yields top leaderboard results.
The approach achieves second place in the MRQA leaderboard.
Abstract
To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a simple negative sampling technique to be particularly effective, even though it is typically used for datasets that include unanswerable questions, such as SQuAD 2.0. When applied in conjunction with per-domain sampling, our XLNet (Yang et al., 2019)-based submission achieved the second best Exact Match and F1 in the MRQA leaderboard competition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Residual Connection · Linear Warmup With Linear Decay · Byte Pair Encoding · SentencePiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · Softmax · Dropout
