Transformer to CNN: Label-scarce distillation for efficient text classification
Yew Ken Chia, Sam Witteveen, Martin Andrews

TL;DR
This paper introduces a convolutional student model trained via distillation from a large NLP model, achieving significant speed and size reductions while maintaining or improving performance on text classification tasks.
Contribution
It presents a novel CNN-based student architecture trained through distillation to efficiently perform text classification with limited labeled data.
Findings
300x inference speedup
39x reduction in parameters
Student surpasses teacher in some tasks
Abstract
Significant advances have been made in Natural Language Processing (NLP) modelling since the beginning of 2018. The new approaches allow for accurate results, even when there is little labelled data, because these NLP models can benefit from training on both task-agnostic and task-specific unlabelled data. However, these advantages come with significant size and computational costs. This workshop paper outlines how our proposed convolutional student architecture, having been trained by a distillation process from a large-scale model, can achieve 300x inference speedup and 39x reduction in parameter count. In some cases, the student model performance surpasses its teacher on the studied tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
