Performance Impact Caused by Hidden Bias of Training Data for   Recognizing Textual Entailment

Masatoshi Tsuchiya

arXiv:1804.08117·cs.CL·April 24, 2018·30 cites

Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment

Masatoshi Tsuchiya

PDF

Open Access

TL;DR

This paper introduces a statistical hypothesis testing method to detect hidden biases in training data for recognizing textual entailment, revealing that the SNLI corpus contains biases that can predict labels without context, impacting model performance.

Contribution

It proposes a novel hypothesis testing approach to identify hidden biases in RTE datasets, highlighting the influence of such biases on neural network model performance.

Findings

01

SNLI corpus exhibits hidden bias enabling label prediction without context

02

The proposed method effectively detects biases in large RTE datasets

03

Hidden biases significantly impact neural network model performance in RTE

Abstract

The quality of training data is one of the crucial problems when a learning-centered approach is employed. This paper proposes a new method to investigate the quality of a large corpus designed for the recognizing textual entailment (RTE) task. The proposed method, which is inspired by a statistical hypothesis test, consists of two phases: the first phase is to introduce the predictability of textual entailment labels as a null hypothesis which is extremely unacceptable if a target corpus has no hidden bias, and the second phase is to test the null hypothesis using a Naive Bayes model. The experimental result of the Stanford Natural Language Inference (SNLI) corpus does not reject the null hypothesis. Therefore, it indicates that the SNLI corpus has a hidden bias which allows prediction of textual entailment labels from hypothesis sentences even if no context information is given by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies