Structured Pruning of a BERT-based Question Answering Model

J.S. McCarley; Rishav Chakravarti; and Avirup Sil

arXiv:1910.06360·cs.CL·April 13, 2021·72 cites

Structured Pruning of a BERT-based Question Answering Model

J.S. McCarley, Rishav Chakravarti, and Avirup Sil

PDF

Open Access

TL;DR

This paper explores structured pruning of BERT-based question answering models combined with task-specific distillation, achieving faster inference with minimal accuracy loss without pretraining distillation.

Contribution

It introduces an efficient method for compressing question answering models through structured pruning and distillation, applicable to BERT and RoBERTa, improving speed with minimal accuracy impact.

Findings

01

Near-doubled inference speed

02

Less than 0.5 F1-point accuracy loss

03

Effective across multiple datasets

Abstract

The recent trend in industry-setting Natural Language Processing (NLP) research has been to operate large %scale pretrained language models like BERT under strict computational limits. While most model compression work has focused on "distilling" a general-purpose language representation using expensive pretraining distillation, less attention has been paid to creating smaller task-specific language representations which, arguably, are more useful in an industry setting. In this paper, we investigate compressing BERT- and RoBERTa-based question answering systems by structured pruning of parameters from the underlying transformer model. We find that an inexpensive combination of task-specific structured pruning and task-specific distillation, without the expense of pretraining distillation, yields highly-performing models across a range of speed/accuracy tradeoff operating points. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsPruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding