Fine-tuning of Language Models with Discriminator
Vadim Popov, Mikhail Kudinov

TL;DR
This paper introduces a novel fine-tuning method for language models that combines cross-entropy loss with a discriminator-estimated reverse Kullback-Leibler divergence, improving performance on language modeling tasks.
Contribution
It proposes a new fine-tuning approach using a discriminator to estimate divergence, enhancing language model quality with minimal hyperparameter tuning.
Findings
Perplexity on Penn Treebank improved from 52.4 to 52.1
Method scales well across architectures and datasets
Requires only learning rate as hyperparameter
Abstract
Cross-entropy loss is a common choice when it comes to multiclass classification tasks and language modeling in particular. Minimizing this loss results in language models of very good quality. We show that it is possible to fine-tune these models and make them perform even better if they are fine-tuned with sum of cross-entropy loss and reverse Kullback-Leibler divergence. The latter is estimated using discriminator network that we train in advance. During fine-tuning probabilities of rare words that are usually underestimated by language models become bigger. The novel approach that we propose allows us to reach state-of-the-art quality on Penn Treebank: perplexity decreases from 52.4 to 52.1. Our fine-tuning algorithm is rather fast, scales well to different architectures and datasets and requires almost no hyperparameter tuning: the only hyperparameter that needs to be tuned is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
