Understanding BERT Rankers Under Distillation

Luyu Gao; Zhuyun Dai; Jamie Callan

arXiv:2007.11088·cs.IR·July 23, 2020

Understanding BERT Rankers Under Distillation

Luyu Gao, Zhuyun Dai, Jamie Callan

PDF

TL;DR

This paper explores how to effectively distill BERT rankers to smaller, faster models without losing performance, enabling practical deployment in real-world search systems.

Contribution

It introduces a proper distillation procedure that significantly speeds up BERT rankers while maintaining their high retrieval accuracy.

Findings

01

Up to nine times speedup achieved

02

Proper distillation preserves state-of-the-art performance

03

Effective knowledge transfer from BERT to smaller models

Abstract

Deep language models such as BERT pre-trained on large corpus have given a huge performance boost to the state-of-the-art information retrieval ranking systems. Knowledge embedded in such models allows them to pick up complex matching signals between passages and queries. However, the high computation cost during inference limits their deployment in real-world search scenarios. In this paper, we study if and how the knowledge for search within BERT can be transferred to a smaller ranker through distillation. Our experiments demonstrate that it is crucial to use a proper distillation procedure, which produces up to nine times speedup while preserving the state-of-the-art performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Residual Connection · Layer Normalization · Adam · Multi-Head Attention · Attention Dropout · Dropout · WordPiece · Weight Decay · Linear Warmup With Linear Decay