Causal Distillation for Language Models

Zhengxuan Wu; Atticus Geiger; Josh Rozner; Elisa Kreiss; Hanson Lu,; Thomas Icard; Christopher Potts; Noah D. Goodman

arXiv:2112.02505·cs.CL·June 7, 2022

Causal Distillation for Language Models

Zhengxuan Wu, Atticus Geiger, Josh Rozner, Elisa Kreiss, Hanson Lu,, Thomas Icard, Christopher Potts, Noah D. Goodman

PDF

1 Repo

TL;DR

This paper introduces Causal Distillation with interchange intervention training (IIT), a novel method that enhances language model distillation by encouraging the student to imitate the teacher's causal computation process, leading to improved performance.

Contribution

It proposes IIT as a new objective for distillation that promotes causal abstraction, improving efficiency and performance of language models over standard methods.

Findings

01

Lower perplexity on Wikipedia masked language modeling

02

Improved results on GLUE benchmark

03

Better performance on SQuAD and CoNLL-2003

Abstract

Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

frankaging/Causal-Distill
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · WordPiece · Weight Decay · Softmax · Residual Connection · Adam · Dropout