PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, and Heinrich K\"uttler, Aleksandra Piktus, Pontus Stenetorp and, Sebastian Riedel

TL;DR
This paper introduces PAQ, a massive dataset of 65 million automatically generated QA pairs, and RePAQ, a new retrieval system that improves open-domain question answering speed and accuracy by leveraging this dataset.
Contribution
The paper presents PAQ, a large-scale QA pair resource, and RePAQ, a retrieval model that enhances accuracy and efficiency in open-domain question answering systems.
Findings
RePAQ matches recent retrieve-and-read models in accuracy while being faster.
PAQ enables training of CBQA models that outperform baselines by 5%.
RePAQ can be configured for size and speed, maintaining high accuracy.
Abstract
Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models lack the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sentence-transformers/all-MiniLM-L6-v2model· 200.9M dl· ♡ 4639200.9M dl♡ 4639
- 🤗sentence-transformers/all-mpnet-base-v2model· 28.7M dl· ♡ 126628.7M dl♡ 1266
- 🤗Hum-Works/lodestone-base-4096-v1model· 112 dl· ♡ 12112 dl♡ 12
- 🤗arredondos/my_sentence_transformermodel· 1 dl1 dl
- 🤗flax-sentence-embeddings/all_datasets_v3_MiniLM-L12model· 5 dl· ♡ 25 dl♡ 2
- 🤗flax-sentence-embeddings/all_datasets_v3_MiniLM-L6model· 3 dl3 dl
- 🤗flax-sentence-embeddings/all_datasets_v3_distilroberta-basemodel· 1 dl· ♡ 21 dl♡ 2
- 🤗flax-sentence-embeddings/all_datasets_v3_mpnet-basemodel· 596 dl· ♡ 13596 dl♡ 13
- 🤗flax-sentence-embeddings/all_datasets_v3_roberta-largemodel· 24 dl· ♡ 1324 dl♡ 13
- 🤗flax-sentence-embeddings/all_datasets_v4_MiniLM-L12model· 2 dl· ♡ 22 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
