TL;DR
This paper introduces a synthetic data augmentation method using question generation models to enhance zero-shot cross-lingual question answering, significantly improving performance across multiple multilingual datasets without additional annotated data.
Contribution
The paper presents a novel approach leveraging question generation models for synthetic data augmentation to boost cross-lingual QA performance without extra annotation.
Findings
Outperforms baselines trained only on English data
Achieves new state-of-the-art on four multilingual datasets
Demonstrates effectiveness of synthetic data in multilingual QA
Abstract
Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
