PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Patrick Lewis; Yuxiang Wu; Linqing Liu; Pasquale Minervini; and Heinrich K\"uttler; Aleksandra Piktus; Pontus Stenetorp and; Sebastian Riedel

arXiv:2102.07033·cs.CL·February 16, 2021

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, and Heinrich K\"uttler, Aleksandra Piktus, Pontus Stenetorp and, Sebastian Riedel

PDF

1 Repo 10 Models 1 Datasets

TL;DR

This paper introduces PAQ, a massive dataset of 65 million automatically generated QA pairs, and RePAQ, a new retrieval system that improves open-domain question answering speed and accuracy by leveraging this dataset.

Contribution

The paper presents PAQ, a large-scale QA pair resource, and RePAQ, a retrieval model that enhances accuracy and efficiency in open-domain question answering systems.

Findings

01

RePAQ matches recent retrieve-and-read models in accuracy while being faster.

02

PAQ enables training of CBQA models that outperform baselines by 5%.

03

RePAQ can be configured for size and speed, maintaining high accuracy.

Abstract

Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models lack the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/PAQ
pytorchOfficial

Models

Datasets

embedding-data/PAQ_pairs
dataset· 172 dl
172 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.