Language Modelling via Learning to Rank
Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz

TL;DR
This paper proposes a novel approach to language modeling by framing it as a ranking task and using rank-based knowledge distillation, which improves perplexity and can be achieved without pre-trained models.
Contribution
It introduces a ranking-based framework for language modeling and demonstrates that simple N-gram teachers can be as effective as complex pre-trained models.
Findings
Rank-based KD improves perplexity over KL-based KD.
N-gram teachers are competitive with complex models.
GPT-2 is the most effective teacher among tested models.
Abstract
We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top- ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using -grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally improves perplexity (PPL), often with statistical significance, when compared to Kullback-Leibler-based KD. Surprisingly, given the simplicity of the method, -grams act as competitive teachers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Knowledge Distillation · Byte Pair Encoding · Cosine Annealing · Softmax · Linear Warmup With Cosine Annealing
