An Empirical Study of Uniform-Architecture Knowledge Distillation in   Document Ranking

Xubo Qin; Xiyuan Liu; Xiongfeng Zheng; Jie Liu; Yutao Zhu

arXiv:2302.04112·cs.IR·February 9, 2023

An Empirical Study of Uniform-Architecture Knowledge Distillation in Document Ranking

Xubo Qin, Xiyuan Liu, Xiongfeng Zheng, Jie Liu, Yutao Zhu

PDF

Open Access

TL;DR

This study investigates how different loss functions affect the effectiveness of uniform-architecture knowledge distillation for BERT-based document ranking models, highlighting the importance of pairwise loss in training smaller models.

Contribution

It provides empirical insights into the optimal distillation strategies for cross-encoder ranking models, emphasizing the role of pairwise loss functions.

Findings

01

Pairwise loss of hard labels is crucial for training student models.

02

Intermediate Transformer layer distillation may reduce performance.

03

Optimal distillation strategies differ from those in general NLP tasks.

Abstract

Although BERT-based ranking models have been commonly used in commercial search engines, they are usually time-consuming for online ranking tasks. Knowledge distillation, which aims at learning a smaller model with comparable performance to a larger model, is a common strategy for reducing the online inference latency. In this paper, we investigate the effect of different loss functions for uniform-architecture distillation of BERT-based ranking models. Here "uniform-architecture" denotes that both teacher and student models are in cross-encoder architecture, while the student models include small-scaled pre-trained language models. Our experimental results reveal that the optimal distillation configuration for ranking tasks is much different than general natural language processing tasks. Specifically, when the student models are in cross-encoder architecture, a pairwise loss of hard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Expert finding and Q&A systems · Text and Document Classification Technologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Label Smoothing · Softmax · Residual Connection