Rethinking the effects of data contamination in Code Intelligence
Zhen Yang, Hongyi Lin, Yifan He, Junqi Wang, Zeyu Sun, Shuo Liu, Jie Xu, Pengpeng Wang, Zhongxing Yu, Qingyuan Liang

TL;DR
This study systematically investigates the impact of various data contamination scenarios on code intelligence models, revealing that contamination effects vary with model type and training paradigm, challenging previous assumptions.
Contribution
It introduces a comprehensive empirical analysis of fine-grained contamination effects across multiple models, tasks, and programming languages, filling a gap in prior research.
Findings
Paired contamination does not significantly overestimate performance in PLMs.
LLMs are highly affected by paired contamination during inference.
Other contamination scenarios have negligible impact on model performance.
Abstract
In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. Previous studies mainly focused on sample-level contamination, ignoring partial contamination scenarios that are pervasive in code intelligence. This paper fills this gap and presents a systematic empirical study to investigate the fine-grained data contamination on mainstream code tasks. Our study involves diverse representative PLMs: RoBERTa and GPT-2, and LLMs: LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization, across two Programming Languages (PLs): Java and Python. We categorize contamination scenarios…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices
MethodsLinear Layer · WordPiece · Weight Decay · Cosine Annealing · Multi-Head Attention · Attention Is All You Need · Discriminative Fine-Tuning · Linear Warmup With Linear Decay · Dropout · Residual Connection
