Transformer to CNN: Label-scarce distillation for efficient text   classification

Yew Ken Chia; Sam Witteveen; Martin Andrews

arXiv:1909.03508·cs.LG·September 10, 2019·21 cites

Transformer to CNN: Label-scarce distillation for efficient text classification

Yew Ken Chia, Sam Witteveen, Martin Andrews

PDF

Open Access

TL;DR

This paper introduces a convolutional student model trained via distillation from a large NLP model, achieving significant speed and size reductions while maintaining or improving performance on text classification tasks.

Contribution

It presents a novel CNN-based student architecture trained through distillation to efficiently perform text classification with limited labeled data.

Findings

01

300x inference speedup

02

39x reduction in parameters

03

Student surpasses teacher in some tasks

Abstract

Significant advances have been made in Natural Language Processing (NLP) modelling since the beginning of 2018. The new approaches allow for accurate results, even when there is little labelled data, because these NLP models can benefit from training on both task-agnostic and task-specific unlabelled data. However, these advantages come with significant size and computational costs. This workshop paper outlines how our proposed convolutional student architecture, having been trained by a distillation process from a large-scale model, can achieve 300x inference speedup and 39x reduction in parameter count. In some cases, the student model performance surpasses its teacher on the studied tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications