Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins
Lukas Gienapp, Niklas Deckers, Martin Potthast, Harrisen Scells

TL;DR
This paper introduces a novel self-supervised loss function for training bi-encoder retrieval models, eliminating the need for teacher models and batch sampling, while achieving comparable effectiveness with significantly reduced data and training time.
Contribution
The authors propose a parameter-free self-distillation loss that leverages pre-trained language models for implicit hard negative mining, simplifying training and improving efficiency.
Findings
Self-distillation matches teacher distillation effectiveness with less data.
Training speed improves by 3x to 15x over traditional methods.
The approach requires only 13.5% of the data used in previous methods.
Abstract
Representation-based retrieval models, so-called bi-encoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art bi-encoders are trained using an expensive training regime involving knowledge distillation from a teacher model and batch-sampling. Instead of relying on a teacher model, we contribute a novel parameter-free loss function for self-supervision that exploits the pre-trained language modeling capabilities of the encoder model as a training signal, eliminating the need for batch sampling by performing implicit hard negative mining. We investigate the capabilities of our proposed approach through extensive experiments, demonstrating that self-distillation can match the effectiveness of teacher distillation using only 13.5% of the data, while offering a speedup in training time between 3x and 15x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
