Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
Sebastian Hofst\"atter, Sophia Althammer, Michael Schr\"oder, Mete, Sertkan, Allan Hanbury

TL;DR
This paper introduces a cross-architecture knowledge distillation method with a margin-focused loss to enhance the effectiveness of efficient neural ranking models, bridging the gap with larger models without sacrificing efficiency.
Contribution
It proposes a novel Margin-MSE loss for distillation that accounts for score distribution differences across architectures, improving neural ranking performance.
Findings
Significant effectiveness gains across multiple architectures.
Improved retrieval performance with no efficiency loss.
Enhanced nearest neighbor retrieval with distillation.
Abstract
Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarcomodel· 16 dl· ♡ 1516 dl♡ 15
- 🤗sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarcomodel· 2 dl2 dl
- 🤗sebastian-hofstaetter/distilbert-dot-margin_mse-T2-msmarcomodel· 58 dl· ♡ 258 dl♡ 2
- 🤗sebastian-hofstaetter/prettr-distilbert-split_at_3-margin_mse-T2-msmarcomodel
- 🤗bobox/DeBERTaV3-small-GeneralSentenceTransformer-v2-checkpoints-tmpmodel· 2 dl2 dl
- 🤗hatemestinbejaia/mmarco-Arabic-mMiniLML-bi-encoder-NoKD-v1model· 3 dl3 dl
- 🤗hatemestinbejaia/mmarco-Arabic-mMiniLML-bi-encoder-KD-v1model· 5 dl5 dl
- 🤗hatemestinbejaia/mmarco-Arabic-AraDPR-bi-encoder-KD-v1model· 6 dl6 dl
- 🤗hatemestinbejaia/mmarco-Arabic-AraDPR-bi-encoder-NoKD-v1model· 9 dl9 dl
- 🤗hatemestinbejaia/mmarco-Arabic-AraElectra-bi-encoder-KD-v1model· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Topic Modeling
MethodsLinear Layer · Knowledge Distillation · Dense Connections · Layer Normalization · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay
