Measuring the Robustness of NLP Models to Domain Shifts

Nitay Calderon; Naveh Porat; Eyal Ben-David; Alexander Chapanin; Zorik; Gekhman; Nadav Oved; Vitaly Shalumov; Roi Reichart

arXiv:2306.00168·cs.CL·April 23, 2024·1 cites

Measuring the Robustness of NLP Models to Domain Shifts

Nitay Calderon, Naveh Porat, Eyal Ben-David, Alexander Chapanin, Zorik, Gekhman, Nadav Oved, Vitaly Shalumov, Roi Reichart

PDF

Open Access 2 Repos 2 Videos

TL;DR

This paper introduces a comprehensive benchmark for measuring NLP models' robustness to domain shifts, emphasizing the importance of using both Source Drop and Target Drop metrics to better understand model degradation across diverse tasks and models.

Contribution

The authors curated a diverse NLP benchmark and conducted a large-scale study comparing fine-tuned models and few-shot LLMs, highlighting the significance of the Target Drop metric for evaluating robustness.

Findings

01

Few-shot LLMs often outperform fine-tuned models cross-domain.

02

Large Source Drop can be due to harder domains rather than true robustness issues.

03

Using both SD and TD provides a more complete picture of model robustness.

Abstract

Existing research on Domain Robustness (DR) suffers from disparate setups, limited task variety, and scarce research on recent capabilities such as in-context learning. Furthermore, the common practice of measuring DR might not be fully accurate. Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation. However, we argue that the Target Drop (TD), which measures degradation from the target in-domain performance, should be used as a complementary point of view. To address these issues, we first curated a DR benchmark comprised of 7 diverse NLP tasks, which enabled us to measure both the SD and the TD. We then conducted a comprehensive large-scale DR study involving over 14,000 domain shifts across 21 fine-tuned models and few-shot LLMs. We found that both model types suffer from drops…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Measuring the Robustness of NLP Models to Domain Shifts· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

Methodsfail