Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets
Chi-Chien Tsai, Chia-Mu Yu, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

TL;DR
This paper introduces DePA, a line-level perplexity analysis method that effectively detects and cleanses dead code poisoning in code datasets, significantly improving detection accuracy and speed over existing techniques.
Contribution
DePA is a novel line-level detection method that leverages code structure and context to identify dead code poisoning, outperforming existing perplexity-based approaches.
Findings
DePA achieves 0.14-0.19 higher detection F1-score.
DePA increases poisoned segment localization precision by 44-65%.
DePA improves detection speed by up to 23 times.
Abstract
The increasing adoption of large language models (LLMs) for code-related tasks has raised concerns about the security of their training datasets. One critical threat is dead code poisoning, where syntactically valid but functionally redundant code is injected into training data to manipulate model behavior. Such attacks can degrade the performance of neural code search systems, leading to biased or insecure code suggestions. Existing detection methods, such as token-level perplexity analysis, fail to effectively identify dead code due to the structural and contextual characteristics of programming languages. In this paper, we propose DePA (Dead Code Perplexity Analysis), a novel line-level detection and cleansing method tailored to the structural properties of code. DePA computes line-level perplexity by leveraging the contextual relationships between code lines and identifies anomalous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Security and Verification in Computing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
