CODECLEANER: Elevating Standards with A Robust Data Contamination Mitigation Toolkit
Jialun Cao, Songqiang Chen, Wuqi Zhang, Hau Ching Lo and, Shing-Chi Cheung

TL;DR
This paper introduces CODECLEANER, an open-source toolkit with code refactoring operators that significantly reduce data contamination in code language model evaluations, enhancing reliability and industrial adoption.
Contribution
It presents the first systematic study of code refactoring operators' effectiveness across multiple scales and languages, along with an open-source toolkit for contamination mitigation.
Findings
65% reduction in data overlap ratio with all operators
Effective in Python and generalizable to Java
Facilitates more reliable CLM performance evaluation
Abstract
Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage code language models (CLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train CLMs, severely undermining the credibility of performance evaluations. For software companies considering the integration of CLM-based techniques into their development pipeline, this uncertainty about true performance metrics poses an unacceptable business risk. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. It provides a practical alternative to the resource-intensive process of building contamination-free evaluation datasets, which would require companies to collect, clean, and label code created after…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Electrostatic Discharge in Electronics · Cloud Data Security Solutions
