Understanding BERT Rankers Under Distillation
Luyu Gao, Zhuyun Dai, Jamie Callan

TL;DR
This paper explores how to effectively distill BERT rankers to smaller, faster models without losing performance, enabling practical deployment in real-world search systems.
Contribution
It introduces a proper distillation procedure that significantly speeds up BERT rankers while maintaining their high retrieval accuracy.
Findings
Up to nine times speedup achieved
Proper distillation preserves state-of-the-art performance
Effective knowledge transfer from BERT to smaller models
Abstract
Deep language models such as BERT pre-trained on large corpus have given a huge performance boost to the state-of-the-art information retrieval ranking systems. Knowledge embedded in such models allows them to pick up complex matching signals between passages and queries. However, the high computation cost during inference limits their deployment in real-world search scenarios. In this paper, we study if and how the knowledge for search within BERT can be transferred to a smaller ranker through distillation. Our experiments demonstrate that it is crucial to use a proper distillation procedure, which produces up to nine times speedup while preserving the state-of-the-art performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Residual Connection · Layer Normalization · Adam · Multi-Head Attention · Attention Dropout · Dropout · WordPiece · Weight Decay · Linear Warmup With Linear Decay
