Sparse Distillation: Speeding Up Text Classification by Using Bigger   Student Models

Qinyuan Ye; Madian Khabsa; Mike Lewis; Sinong Wang; Xiang Ren; Aaron; Jaech

arXiv:2110.08536·cs.CL·July 26, 2022

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, Aaron, Jaech

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to create larger, sparser student models for text classification that retain high accuracy while achieving significant inference speed improvements, suitable for real-time applications.

Contribution

It proposes a novel approach to distill large, sparse models with n-gram embeddings, significantly enhancing inference speed without sacrificing much accuracy.

Findings

01

Retain 97% of teacher model performance on average

02

Achieve up to 600x inference speed-up on GPUs and CPUs

03

Effective for sentence-pair classification and domain generalization

Abstract

Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the improved inference speed may still be unsatisfactory for real-time or high-volume use cases. In this paper, we aim to further push the limit of inference speed by distilling teacher models into bigger, sparser student models -- bigger in that they scale up to billions of parameters; sparser in that most of the model parameters are n-gram embeddings. Our experiments on six single-sentence text classification tasks show that these student models retain 97% of the RoBERTa-Large teacher performance on average, and meanwhile achieve up to 600x speed-up on both GPUs and CPUs at inference time. Further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ink-usc/sparse-distillation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings