LEAD: Liberal Feature-based Distillation for Dense Retrieval
Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jingwen Lu, Yan Zhang,, Linjun Yang, Rangan Majumder, Nan Duan

TL;DR
LEAD introduces a flexible feature-based distillation method that aligns intermediate layer distributions between teacher and student models, improving dense retrieval performance without constraints on vocabularies or architectures.
Contribution
The paper proposes LEAD, a novel, extendable, and architecture-agnostic feature-based distillation approach for dense retrieval models.
Findings
LEAD outperforms baseline methods on MS MARCO and TREC benchmarks.
It is effective across different model architectures and datasets.
LEAD is portable and does not require specific vocabularies or tokenizers.
Abstract
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model. Traditional methods include response-based methods and feature-based methods. Response-based methods are widely used but suffer from lower upper limits of performance due to their ignorance of intermediate signals, while feature-based methods have constraints on vocabularies, tokenizers and model architectures. In this paper, we propose a liberal feature-based distillation method (LEAD). LEAD aligns the distribution between the intermediate layers of teacher model and student model, which is effective, extendable, portable and has no requirements on vocabularies, tokenizers, or model architectures. Extensive experiments show the effectiveness of LEAD on widely-used benchmarks, including MS MARCO Passage Ranking, TREC 2019 DL Track, MS MARCO Document Ranking and TREC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
