Knowledge Distillation of Russian Language Models with Reduction of   Vocabulary

Alina Kolesnikova; Yuri Kuratov; Vasily Konovalov; Mikhail Burtsev

arXiv:2205.02340·cs.CL·May 6, 2022

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Alina Kolesnikova, Yuri Kuratov, Vasily Konovalov, Mikhail Burtsev

PDF

Open Access 1 Repo 5 Models

TL;DR

This paper introduces techniques for reducing the vocabulary size of Russian language models through knowledge distillation, achieving significant compression while maintaining performance on various benchmarks.

Contribution

It presents novel alignment methods enabling effective knowledge distillation with reduced vocabulary, addressing a key challenge in model compression.

Findings

01

Achieved 17x to 49x model compression

02

Maintained comparable quality with 1.7x smaller models

03

Demonstrated effectiveness on multiple Russian NLP benchmarks

Abstract

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ayeffkay/rubert-tiny
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational Physics and Python Applications

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Softmax · Multi-Head Attention · Label Smoothing · Adam