Structured Pruning of a BERT-based Question Answering Model
J.S. McCarley, Rishav Chakravarti, and Avirup Sil

TL;DR
This paper explores structured pruning of BERT-based question answering models combined with task-specific distillation, achieving faster inference with minimal accuracy loss without pretraining distillation.
Contribution
It introduces an efficient method for compressing question answering models through structured pruning and distillation, applicable to BERT and RoBERTa, improving speed with minimal accuracy impact.
Findings
Near-doubled inference speed
Less than 0.5 F1-point accuracy loss
Effective across multiple datasets
Abstract
The recent trend in industry-setting Natural Language Processing (NLP) research has been to operate large %scale pretrained language models like BERT under strict computational limits. While most model compression work has focused on "distilling" a general-purpose language representation using expensive pretraining distillation, less attention has been paid to creating smaller task-specific language representations which, arguably, are more useful in an industry setting. In this paper, we investigate compressing BERT- and RoBERTa-based question answering systems by structured pruning of parameters from the underlying transformer model. We find that an inexpensive combination of task-specific structured pruning and task-specific distillation, without the expense of pretraining distillation, yields highly-performing models across a range of speed/accuracy tradeoff operating points. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsPruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding
