Stress Test Evaluation for Natural Language Inference
Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose,, Graham Neubig

TL;DR
This paper introduces stress tests to evaluate whether natural language inference models truly understand semantic content, revealing their strengths and weaknesses across challenging linguistic phenomena.
Contribution
It proposes a novel stress test methodology for assessing the inferential capabilities of NLI models beyond standard datasets.
Findings
Models show varying performance on linguistic phenomena
Stress tests reveal specific weaknesses in models
Results suggest directions for improving NLI systems
Abstract
Natural language inference (NLI) is the task of determining if a natural language hypothesis can be inferred from a given premise in a justifiable manner. NLI was proposed as a benchmark task for natural language understanding. Existing models perform well at standard datasets for NLI, achieving impressive results across different genres of text. However, the extent to which these models understand the semantic content of sentences is unclear. In this work, we propose an evaluation methodology consisting of automatically constructed "stress tests" that allow us to examine whether systems have the ability to make real inferential decisions. Our evaluation of six sentence-encoder models on these stress tests reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena, and suggests important directions for future work in this area.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
