Synthetic Target Domain Supervision for Open Retrieval QA
Revanth Gangi Reddy, Bhavani Iyer, Md Arafat Sultan, Rong Zhang,, Avirup Sil, Vittorio Castelli, Radu Florian, Salim Roukos

TL;DR
This paper improves neural passage retrieval for open domain question answering in specialized domains by using synthetic training data to enhance robustness, outperforming traditional methods like BM25 in out-of-domain scenarios.
Contribution
It introduces a synthetic supervision approach for fine-tuning DPR, significantly boosting its performance in domain-specific open retrieval QA tasks.
Findings
DPR underperforms compared to BM25 on specialized domains without fine-tuning.
Synthetic training data improves DPR's robustness and out-of-domain performance.
Ensembling DPR with BM25 achieves state-of-the-art results on multiple datasets.
Abstract
Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR) -- a state-of-the-art (SOTA) open domain neural retrieval model -- on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domain shift, we explore its fine-tuning with synthetic training examples, which we generate from unlabeled target domain text using a text-to-text generator. In our experiments, this noisy but fully automated target domain supervision gives DPR a sizable advantage over BM25 in out-of-domain settings, making it a more viable model in practice. Finally, an ensemble of BM25 and our improved DPR model yields the best results, further pushing the SOTA for open retrieval QA on multiple out-of-domain test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
