Training on the Test Model: Contamination in Ranking Distillation

Vishakha Suresh Kalal; Andrew Parry; Sean MacAvaney

arXiv:2411.02284·cs.IR·November 5, 2024

Training on the Test Model: Contamination in Ranking Distillation

Vishakha Suresh Kalal, Andrew Parry, Sean MacAvaney

PDF

Open Access 1 Repo

TL;DR

This paper investigates how data contamination in teacher models affects the quality of knowledge distillation for ranking tasks, highlighting risks when training with opaque models.

Contribution

It demonstrates that contamination can occur even with small test data fractions and emphasizes caution in distillation from black-box models.

Findings

01

Contamination occurs even with small test data fractions.

02

Distillation techniques are susceptible to data contamination.

03

Caution is advised when using opaque teacher models.

Abstract

Neural approaches to ranking based on pre-trained language models are highly effective in ad-hoc search. However, the computational expense of these models can limit their application. As such, a process known as knowledge distillation is frequently applied to allow a smaller, efficient model to learn from an effective but expensive model. A key example of this is the distillation of expensive API-based commercial Large Language Models into smaller production-ready models. However, due to the opacity of training data and processes of most commercial models, one cannot ensure that a chosen test collection has not been observed previously, creating the potential for inadvertent data contamination. We, therefore, investigate the effect of a contaminated teacher model in a distillation setting. We evaluate several distillation techniques to assess the degree to which contamination occurs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Parry-Parry/ContaminatedDistillation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms

MethodsKnowledge Distillation