WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

Michael Dinzinger; Laura Caspari; Ali Salman; Irvin Topi; Jelena Mitrovi\'c; Michael Granitzer

arXiv:2602.17327·cs.IR·February 20, 2026

WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrovi\'c, Michael Granitzer

PDF

Open Access

TL;DR

WebFAQ 2.0 is a large, multilingual FAQ dataset with mined hard negatives designed to improve dense retrieval models, enabling more diverse and effective training for cross-lingual question answering.

Contribution

It introduces the largest multilingual FAQ dataset with a novel data collection method and a hard negatives dataset for training dense retrievers, advancing multilingual IR research.

Findings

01

Enhanced multilingual coverage with 198M QA pairs

02

Effective training of dense retrievers using mined hard negatives

03

Openly available datasets and scripts for community use

Abstract

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Expert finding and Q&A systems