General Cross-Architecture Distillation of Pretrained Language Models   into Matrix Embeddings

Lukas Galke; Isabelle Cuber; Christoph Meyer; Henrik Ferdinand; N\"olscher; Angelina Sonderecker; Ansgar Scherp

arXiv:2109.08449·cs.CL·July 29, 2022

General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

Lukas Galke, Isabelle Cuber, Christoph Meyer, Henrik Ferdinand, N\"olscher, Angelina Sonderecker, Ansgar Scherp

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for distilling large pretrained language models into a more efficient matrix embedding architecture, achieving competitive performance with significantly fewer parameters and faster inference.

Contribution

It presents a new cross-architecture distillation approach into a matrix-based model, extending CMOW with bidirectional and task-agnostic features, improving efficiency and performance.

Findings

01

Competitive results to DistilBERT on question similarity and textual entailment.

02

Uses half the parameters and is three times faster in inference.

03

Doubling of scores on linguistic acceptability tasks compared to previous methods.

Abstract

Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs, such as sentence similarity and natural language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lgalke/cross-architecture-distillation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM · Adam