WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense   Retrieval

Michael Dinzinger; Laura Caspari; Kanishka Ghosh Dastidar and; Jelena Mitrovi\'c; Michael Granitzer

arXiv:2502.20936·cs.CL·March 3, 2025

WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval

Michael Dinzinger, Laura Caspari, Kanishka Ghosh Dastidar and, Jelena Mitrovi\'c, Michael Granitzer

PDF

1 Models 5 Datasets

TL;DR

WebFAQ is a comprehensive multilingual QA dataset derived from schema.org annotations, enabling improved dense retrieval models and high-quality bilingual corpora across over 1000 language pairs.

Contribution

The paper introduces WebFAQ, a large-scale, high-quality multilingual QA dataset and benchmarks, along with a method for generating high-quality bilingual corpora using automated bitext mining.

Findings

01

Fine-tuning XLM-RoBERTa on WebFAQ improves retrieval performance.

02

WebFAQ's datasets generalize well to other multilingual retrieval benchmarks.

03

High-quality bilingual corpora are created for over 1000 language pairs.

Abstract

We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ's efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
PaDaS-Lab/xlm-roberta-base-msmarco-webfaq
model· 1 dl
1 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training