MixText: Linguistically-Informed Interpolation of Hidden Space for   Semi-Supervised Text Classification

Jiaao Chen; Zichao Yang; Diyi Yang

arXiv:2004.12239·cs.CL·April 28, 2020·24 cites

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Jiaao Chen, Zichao Yang, Diyi Yang

PDF

Open Access 2 Repos

TL;DR

MixText introduces a semi-supervised text classification method that leverages hidden space interpolation and data augmentation to improve performance, especially with limited labeled data.

Contribution

The paper proposes TMix, a novel data augmentation technique in hidden space, and demonstrates how mixing labeled, unlabeled, and augmented data enhances semi-supervised learning.

Findings

01

Outperforms state-of-the-art semi-supervised methods on benchmarks.

02

Significant gains when supervision is extremely limited.

03

Publicly available code for reproducibility.

Abstract

This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data.By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsMixText