MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
M\u{a}d\u{a}lina Zgreab\u{a}n, Tejaswini Deoskar, Lasha Abzianidze

TL;DR
This paper introduces MERGE, a method for automatically generating high-quality NLI problem variants by replacing open-class words, to evaluate models' robustness and generalization in natural language inference tasks.
Contribution
The paper presents a novel automated approach for creating reasoning-preserving NLI variants, enabling robust evaluation of model generalization without manual benchmark creation.
Findings
NLI models' performance drops 4-20% on variants
Performance is influenced by word class, probability, and plausibility
Models show low generalizability even on minimally altered problems
Abstract
In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
