Build a Robust QA System with Transformer-based Mixture of Experts
Yu Qing Zhou, Xixuan Julie Liu, Yuanzhe Dong

TL;DR
This paper presents a robust question answering system using a transformer-based Mixture of Experts architecture combined with data augmentation techniques, achieving significant out-of-domain performance improvements.
Contribution
It introduces a novel MoE-based QA model integrated into DistilBERT with simplified routing and demonstrates enhanced robustness through data augmentation.
Findings
Achieved 53.477 F1 score out-of-domain, a 9.52% improvement over baseline.
Demonstrated the effectiveness of MoE architecture in robust QA tasks.
Reported 59.506 F1 and 41.651 EM on the final test set.
Abstract
In this paper, we aim to build a robust question answering system that can adapt to out-of-domain datasets. A single network may overfit to the superficial correlation in the training distribution, but with a meaningful number of expert sub-networks, a gating network that selects a sparse combination of experts for each input, and careful balance on the importance of expert sub-networks, the Mixture-of-Experts (MoE) model allows us to train a multi-task learner that can be generalized to out-of-domain datasets. We also explore the possibility of bringing the MoE layers up to the middle of the DistilBERT and replacing the dense feed-forward network with a sparsely-activated switch FFN layers, similar to the Switch Transformer architecture, which simplifies the MoE routing algorithm with reduced communication and computational costs. In addition to model architectures, we explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Linear Warmup With Linear Decay · Layer Normalization · Dropout · WordPiece · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Label Smoothing
