TL;DR
This paper analyzes how to efficiently develop multilingual question answering systems by leveraging existing resources, focusing on data augmentation techniques and strategic dataset development to maximize language coverage.
Contribution
It provides an extensive analysis of few-shot, translation, and permutation methods for multilingual QA and offers recommendations for optimizing dataset creation within fixed annotation budgets.
Findings
Few-shot approaches with translations improve multilingual QA performance.
Permutation of context-question-answer pairs enhances data diversity.
Strategic dataset development can maximize language coverage efficiently.
Abstract
Question answering (QA) in English has been widely explored, but multilingual datasets are relatively new, with several methods attempting to bridge the gap between high- and low-resourced languages using data augmentation through translation and cross-lingual transfer. In this project, we take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages. Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs. In addition, we make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems. Code and data for reproducing our experiments are available here:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
