GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval
Timo M\"oller, Julian Risch, Malte Pietsch

TL;DR
This paper introduces GermanQuAD, a new German question answering dataset, and demonstrates its effectiveness in training non-English QA and passage retrieval models, outperforming multilingual approaches and highlighting the importance of native annotations.
Contribution
The paper presents GermanQuAD and GermanDPR, the first non-English datasets for extractive QA and dense passage retrieval, improving non-English NLP research and model performance.
Findings
GermanQuAD outperforms multilingual models in QA tasks.
Machine-translated data cannot replace native annotations.
GermanDPR enables effective non-English passage retrieval.
Abstract
A major challenge of research on non-English machine reading for question answering (QA) is the lack of annotated datasets. In this paper, we present GermanQuAD, a dataset of 13,722 extractive question/answer pairs. To improve the reproducibility of the dataset creation approach and foster QA research on other languages, we summarize lessons learned and evaluate reformulation of question/answer pairs as a way to speed up the annotation process. An extractive QA model trained on GermanQuAD significantly outperforms multilingual models and also shows that machine-translated training data cannot fully substitute hand-annotated training data in the target language. Finally, we demonstrate the wide range of applications of GermanQuAD by adapting it to GermanDPR, a training dataset for dense passage retrieval (DPR), and train and evaluate the first non-English DPR model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
