TL;DR
This paper investigates the extent of memorization versus generalization in open-domain QA datasets, revealing significant overlap between training and test data and showing that models rely heavily on memorization rather than true understanding.
Contribution
The study provides a detailed analysis of test set overlaps in popular QA datasets and evaluates model performance differences between memorized and novel questions.
Findings
60-70% of test answers are in training sets
30% of questions have near-duplicate training questions
Models perform 63% worse on non-memorized questions
Abstract
Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 60-70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can actually generalize,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Dropout · Adam · Multi-Head Attention · Softmax · Dense Connections · Residual Connection
