Addressing Limited Data for Textual Entailment Across Domains

Chaitanya Shivade; Preethi Raghavan; Siddharth Patwardhan

arXiv:1606.02638·cs.CL·June 9, 2016

Addressing Limited Data for Textual Entailment Across Domains

Chaitanya Shivade, Preethi Raghavan, Siddharth Patwardhan

PDF

TL;DR

This paper develops a new clinical entailment dataset and employs self-training and active learning to improve textual entailment performance across domains with limited labeled data.

Contribution

It introduces a clinical entailment dataset and demonstrates effective domain adaptation techniques like self-training and active learning for textual entailment.

Findings

01

Self-training improves F-score by 15% on newswire and 13% on clinical data.

02

Active learning achieves comparable results with only 6.6% and 5.8% of training data.

03

The ENT system is effective out-of-the-box across multiple domains.

Abstract

We seek to address the lack of labeled data (and high cost of annotation) for textual entailment in some domains. To that end, we first create (for experimental purposes) an entailment dataset for the clinical domain, and a highly competitive supervised entailment system, ENT, that is effective (out of the box) on two domains. We then explore self-training and active learning strategies to address the lack of labeled data. With self-training, we successfully exploit unlabeled data to improve over ENT by 15% F-score on the newswire domain, and 13% F-score on clinical data. On the other hand, our active learning experiments demonstrate that we can match (and even beat) ENT using only 6.6% of the training data in the clinical domain, and only 5.8% of the training data in the newswire domain.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.