Language Modelling via Learning to Rank

Arvid Frydenlund; Gagandeep Singh; Frank Rudzicz

arXiv:2110.06961·cs.CL·December 14, 2021

Language Modelling via Learning to Rank

Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz

PDF

Open Access 1 Video

TL;DR

This paper proposes a novel approach to language modeling by framing it as a ranking task and using rank-based knowledge distillation, which improves perplexity and can be achieved without pre-trained models.

Contribution

It introduces a ranking-based framework for language modeling and demonstrates that simple N-gram teachers can be as effective as complex pre-trained models.

Findings

01

Rank-based KD improves perplexity over KL-based KD.

02

N-gram teachers are competitive with complex models.

03

GPT-2 is the most effective teacher among tested models.

Abstract

We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top- $k$ ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using $N$ -grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally improves perplexity (PPL), often with statistical significance, when compared to Kullback-Leibler-based KD. Surprisingly, given the simplicity of the method, $N$ -grams act as competitive teachers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Language Modelling via Learning to Rank· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Knowledge Distillation · Byte Pair Encoding · Cosine Annealing · Softmax · Linear Warmup With Cosine Annealing