Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question   Answering

Arij Riabi; Thomas Scialom; Rachel Keraron; Beno\^it Sagot; Djam\'e; Seddah; Jacopo Staiano

arXiv:2010.12643·cs.CL·October 15, 2021

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Arij Riabi, Thomas Scialom, Rachel Keraron, Beno\^it Sagot, Djam\'e, Seddah, Jacopo Staiano

PDF

1 Repo

TL;DR

This paper introduces a synthetic data augmentation method using question generation models to enhance zero-shot cross-lingual question answering, significantly improving performance across multiple multilingual datasets without additional annotated data.

Contribution

The paper presents a novel approach leveraging question generation models for synthetic data augmentation to boost cross-lingual QA performance without extra annotation.

Findings

01

Outperforms baselines trained only on English data

02

Achieves new state-of-the-art on four multilingual datasets

03

Demonstrates effectiveness of synthetic data in multilingual QA

Abstract

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/unilm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.