Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains
Alon Albalak, Sharon Levy, and William Yang Wang

TL;DR
This paper presents a cross-lingual open-retrieval question answering system for emergent domains like COVID-19, leveraging automatic translation and deep semantic retrieval to improve performance in low-resource multilingual settings.
Contribution
It introduces a novel method using translation, alignment, and filtering to create English-to-all datasets for training, enhancing cross-lingual retrieval in emergent domains.
Findings
Deep semantic retriever outperforms BM25 baseline in cross-lingual retrieval.
Training on English-to-all data significantly improves retrieval performance.
System code is publicly released for reproducibility and further research.
Abstract
Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
