Robustification of Multilingual Language Models to Real-world Noise in   Crosslingual Zero-shot Settings with Robust Contrastive Pretraining

Asa Cooper Stickland; Sailik Sengupta; Jason Krone; Saab Mansour; He; He

arXiv:2210.04782·cs.CL·February 14, 2023·1 cites

Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining

Asa Cooper Stickland, Sailik Sengupta, Jason Krone, Saab Mansour, He, He

PDF

Open Access 1 Repo

TL;DR

This paper addresses the challenge of noise robustness in multilingual language models by constructing noisy datasets across five languages and four NLP tasks, and proposes a novel Robust Contrastive Pretraining method that significantly improves performance in noisy conditions.

Contribution

The paper introduces Robust Contrastive Pretraining (RCP), a new pretraining approach that enhances multilingual models' robustness to real-world noise across multiple languages and tasks.

Findings

01

RCP improves performance on noisy data by +3.2% in sentence tasks.

02

RCP increases F1-score by +10 in sequence-labeling tasks.

03

Multilingual models' robustness varies significantly across languages and noise types.

Abstract

Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world applications where noise, such as typographical or grammatical mistakes, is abundant and can result in degraded performance. Unfortunately, works which evaluate the robustness of neural models on noisy data and propose improvements, are limited to the English language. Upon analyzing noise in different languages, we observe that noise types vary greatly across languages. Thus, existing investigations do not generalize trivially to multilingual settings. To benchmark the performance of pretrained multilingual language models, we construct noisy datasets covering five languages and four NLP tasks and observe a clear gap in the performance between clean and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/multilingual-robust-contrastive-pretraining
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsTest