Ensemble Distillation for Structured Prediction: Calibrated, Accurate,   Fast-Choose Three

Steven Reich; David Mueller; Nicholas Andrews

arXiv:2010.06721·cs.LG·March 26, 2021

Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast-Choose Three

Steven Reich, David Mueller, Nicholas Andrews

PDF

TL;DR

This paper introduces ensemble distillation as a method to produce well-calibrated, accurate, and fast structured prediction models, effectively replacing ensembles without sacrificing performance, validated on NER and machine translation tasks.

Contribution

It presents a novel ensemble distillation framework for structured prediction that maintains ensemble benefits while enabling single-model inference.

Findings

01

Models retain ensemble performance and calibration benefits.

02

Distilled models are faster and require only one model at test time.

03

Framework effective on NER and machine translation tasks.

Abstract

Modern neural networks do not always produce well-calibrated predictions, even when trained with a proper scoring function such as cross-entropy. In classification settings, simple methods such as isotonic regression or temperature scaling may be used in conjunction with a held-out dataset to calibrate model outputs. However, extending these methods to structured prediction is not always straightforward or effective; furthermore, a held-out calibration set may not always be available. In this paper, we study ensemble distillation as a general framework for producing well-calibrated structured prediction models while avoiding the prohibitive inference-time cost of ensembles. We validate this framework on two tasks: named-entity recognition and machine translation. We find that, across both tasks, ensemble distillation produces models which retain much of, and occasionally improve upon,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.