Reliable Evaluations for Natural Language Inference based on a Unified   Cross-dataset Benchmark

Guanhua Zhang; Bing Bai; Jian Liang; Kun Bai; Conghui Zhu; Tiejun Zhao

arXiv:2010.07676·cs.CL·October 16, 2020·1 cites

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

Guanhua Zhang, Bing Bai, Jian Liang, Kun Bai, Conghui Zhu, Tiejun Zhao

PDF

Open Access

TL;DR

This paper introduces a unified cross-dataset benchmark for Natural Language Inference to evaluate model generalization and debiasing effectiveness, addressing biases in existing datasets and improving evaluation reliability.

Contribution

It proposes a new cross-dataset benchmark with 14 NLI datasets and re-evaluates models and debiasing methods for more trustworthy performance assessment.

Findings

01

Models show reduced performance when evaluated cross-dataset

02

Debiasing methods vary in effectiveness across datasets

03

Benchmark provides a more reliable evaluation framework for NLI

Abstract

Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques