Efficient Transformer Knowledge Distillation: A Performance Review

Nathan Brown; Ashton Williamson; Tahj Anderson; Logan Lawrence

arXiv:2311.13657·cs.CL·November 27, 2023·1 cites

Efficient Transformer Knowledge Distillation: A Performance Review

Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of knowledge distillation on efficient attention transformer models, demonstrating significant performance retention and reduced inference times across various NLP tasks, including a new long-context NER dataset.

Contribution

It provides a comprehensive performance review of knowledge distillation applied to efficient attention transformers and introduces the GONERD dataset for long-context NER evaluation.

Findings

01

Distilled efficient transformers retain up to 98.6% of original performance on short tasks.

02

Achieve up to 94.6% performance retention on long-context QA tasks.

03

Inference times decrease by up to 57.8% with distillation.

Abstract

As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management

MethodsKnowledge Distillation