Concerned with Data Contamination? Assessing Countermeasures in Code Language Model
Jialun Cao, Wuqi Zhang, Shing-Chi Cheung

TL;DR
This study systematically evaluates the effectiveness of various countermeasures against data contamination in code language model evaluations, revealing surprising results about model performance and metric limitations.
Contribution
It provides a comprehensive analysis of countermeasures' impacts on CLMs' performance using a large, timestamped dataset, highlighting unexpected findings and metric limitations.
Findings
CLMs sometimes perform better on post-cut-off data.
Refactoring can improve model performance instead of degrading it.
Perplexity metrics cannot reliably detect data contamination.
Abstract
Various techniques have been proposed to leverage the capabilities of code language models (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness using publicly available datasets, the evaluation can be subject to data contamination threats where the evaluation datasets have already been used to train the concerned CLMs. This can significantly affect the reliability of the evaluation. Different countermeasures have been suggested to mitigate the data contamination threat. Countermeasures include using more recent data, curating new data, and refactoring existing data are introduced, yet it is unclear whether these countermeasures could really mitigate data contamination threats to model evaluation. To fill the gap, we systematically study to quantify the impacts of these countermeasures on CLMs' performance. To facilitate the study, we collected over 2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
