WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval
Michael Dinzinger, Laura Caspari, Kanishka Ghosh Dastidar and, Jelena Mitrovi\'c, Michael Granitzer

TL;DR
WebFAQ is a comprehensive multilingual QA dataset derived from schema.org annotations, enabling improved dense retrieval models and high-quality bilingual corpora across over 1000 language pairs.
Contribution
The paper introduces WebFAQ, a large-scale, high-quality multilingual QA dataset and benchmarks, along with a method for generating high-quality bilingual corpora using automated bitext mining.
Findings
Fine-tuning XLM-RoBERTa on WebFAQ improves retrieval performance.
WebFAQ's datasets generalize well to other multilingual retrieval benchmarks.
High-quality bilingual corpora are created for over 1000 language pairs.
Abstract
We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ's efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
