MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

M\u{a}d\u{a}lina Zgreab\u{a}n; Tejaswini Deoskar; Lasha Abzianidze

arXiv:2510.24295·cs.CL·October 29, 2025

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

M\u{a}d\u{a}lina Zgreab\u{a}n, Tejaswini Deoskar, Lasha Abzianidze

PDF

TL;DR

This paper introduces MERGE, a method for automatically generating high-quality NLI problem variants by replacing open-class words, to evaluate models' robustness and generalization in natural language inference tasks.

Contribution

The paper presents a novel automated approach for creating reasoning-preserving NLI variants, enabling robust evaluation of model generalization without manual benchmark creation.

Findings

01

NLI models' performance drops 4-20% on variants

02

Performance is influenced by word class, probability, and plausibility

03

Models show low generalizability even on minimally altered problems

Abstract

In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.