Training on the Test Model: Contamination in Ranking Distillation
Vishakha Suresh Kalal, Andrew Parry, Sean MacAvaney

TL;DR
This paper investigates how data contamination in teacher models affects the quality of knowledge distillation for ranking tasks, highlighting risks when training with opaque models.
Contribution
It demonstrates that contamination can occur even with small test data fractions and emphasizes caution in distillation from black-box models.
Findings
Contamination occurs even with small test data fractions.
Distillation techniques are susceptible to data contamination.
Caution is advised when using opaque teacher models.
Abstract
Neural approaches to ranking based on pre-trained language models are highly effective in ad-hoc search. However, the computational expense of these models can limit their application. As such, a process known as knowledge distillation is frequently applied to allow a smaller, efficient model to learn from an effective but expensive model. A key example of this is the distillation of expensive API-based commercial Large Language Models into smaller production-ready models. However, due to the opacity of training data and processes of most commercial models, one cannot ensure that a chosen test collection has not been observed previously, creating the potential for inadvertent data contamination. We, therefore, investigate the effect of a contaminated teacher model in a distillation setting. We evaluate several distillation techniques to assess the degree to which contamination occurs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms
MethodsKnowledge Distillation
