Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
Scott Thornton

TL;DR
This large-scale empirical study investigates whether comment-based adversarial attacks can mislead AI models during code vulnerability detection, finding minimal impact and identifying effective defenses like static analysis cross-referencing.
Contribution
The paper provides the first comprehensive evaluation of comment-based adversarial attacks against LLMs in code review, demonstrating their limited effectiveness and proposing robust automated defenses.
Findings
Adversarial comments have negligible effect on detection accuracy.
Static analysis cross-referencing significantly improves detection rates.
Complex attack strategies do not outperform simple comment manipulations.
Abstract
AI-assisted code review is widely used to detect vulnerabilities before production release. Prior work shows that adversarial prompt manipulation can degrade large language model (LLM) performance in code generation. We test whether similar comment-based manipulation misleads LLMs during vulnerability detection. We build a 100-sample benchmark across Python, JavaScript, and Java, each paired with eight comment variants ranging from no comments to adversarial strategies such as authority spoofing and technical deception. Eight frontier models, five commercial and three open-source, are evaluated in 9,366 trials. Adversarial comments produce small, statistically non-significant effects on detection accuracy (McNemar exact p > 0.21; all 95 percent confidence intervals include zero). This holds for commercial models with 89 to 96 percent baseline detection and open-source models with 53 to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
