Distilling Neural Networks for Greener and Faster Dependency Parsing
Mark Anderson, Carlos G\'omez-Rodr\'iguez

TL;DR
This paper demonstrates that teacher-student distillation can significantly reduce the size and increase the speed of neural dependency parsers with minimal accuracy loss, leading to greener and more efficient NLP models.
Contribution
It introduces a distillation approach to compress a state-of-the-art dependency parser, achieving comparable accuracy with much smaller models and faster inference times.
Findings
20 ext% model size retains ~99 ext% accuracy
2.30x faster inference on CPU
Outperforms fastest modern parser on Penn Treebank
Abstract
The carbon footprint of natural language processing research has been increasing in recent years due to its reliance on large and inefficient neural network implementations. Distillation is a network compression technique which attempts to impart knowledge from a large model to a smaller one. We use teacher-student distillation to improve the efficiency of the Biaffine dependency parser which obtains state-of-the-art performance with respect to accuracy and parsing speed (Dozat and Manning, 2017). When distilling to 20\% of the original model's trainable parameters, we only observe an average decrease of 1 point for both UAS and LAS across a number of diverse Universal Dependency treebanks while being 2.30x (1.19x) faster than the baseline model on CPU (GPU) at inference time. We also observe a small increase in performance when compressing to 80\% for some treebanks. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
