Knowledge Distillation of Russian Language Models with Reduction of Vocabulary
Alina Kolesnikova, Yuri Kuratov, Vasily Konovalov, Mikhail Burtsev

TL;DR
This paper introduces techniques for reducing the vocabulary size of Russian language models through knowledge distillation, achieving significant compression while maintaining performance on various benchmarks.
Contribution
It presents novel alignment methods enabling effective knowledge distillation with reduced vocabulary, addressing a key challenge in model compression.
Findings
Achieved 17x to 49x model compression
Maintained comparable quality with 1.7x smaller models
Demonstrated effectiveness on multiple Russian NLP benchmarks
Abstract
Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗DeepPavlov/distilrubert-base-cased-conversationalmodel· 2.1k dl· ♡ 72.1k dl♡ 7
- 🤗DeepPavlov/distilrubert-tiny-cased-conversational-v1model· 100 dl· ♡ 3100 dl♡ 3
- 🤗DeepPavlov/distilrubert-tiny-cased-conversationalmodel· 378 dl· ♡ 3378 dl♡ 3
- 🤗DeepPavlov/distilrubert-tiny-cased-conversational-5kmodel· 19 dl· ♡ 219 dl♡ 2
- 🤗DeepPavlov/distilrubert-small-cased-conversationalmodel· 78 dl· ♡ 378 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational Physics and Python Applications
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Softmax · Multi-Head Attention · Label Smoothing · Adam
