Concerned with Data Contamination? Assessing Countermeasures in Code   Language Model

Jialun Cao; Wuqi Zhang; Shing-Chi Cheung

arXiv:2403.16898·cs.SE·March 29, 2024·3 cites

Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

Jialun Cao, Wuqi Zhang, Shing-Chi Cheung

PDF

Open Access

TL;DR

This study systematically evaluates the effectiveness of various countermeasures against data contamination in code language model evaluations, revealing surprising results about model performance and metric limitations.

Contribution

It provides a comprehensive analysis of countermeasures' impacts on CLMs' performance using a large, timestamped dataset, highlighting unexpected findings and metric limitations.

Findings

01

CLMs sometimes perform better on post-cut-off data.

02

Refactoring can improve model performance instead of degrading it.

03

Perplexity metrics cannot reliably detect data contamination.

Abstract

Various techniques have been proposed to leverage the capabilities of code language models (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness using publicly available datasets, the evaluation can be subject to data contamination threats where the evaluation datasets have already been used to train the concerned CLMs. This can significantly affect the reliability of the evaluation. Different countermeasures have been suggested to mitigate the data contamination threat. Countermeasures include using more recent data, curating new data, and refactoring existing data are introduced, yet it is unclear whether these countermeasures could really mitigate data contamination threats to model evaluation. To fill the gap, we systematically study to quantify the impacts of these countermeasures on CLMs' performance. To facilitate the study, we collected over 2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research