Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code
Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin, Alipour, Bowen Xu

TL;DR
This paper introduces OSeql, an occlusion-based human-in-the-loop method for detecting trojan-triggering inputs in large language models of code, achieving near-perfect detection recall.
Contribution
The paper presents a novel occlusion-based technique, OSeql, for identifying trojan triggers in code models, addressing a critical security vulnerability.
Findings
OSeql detects trojan triggers with almost 100% recall.
The method highlights the importance of input parts in model predictions.
Discussion on false positives and mitigation strategies.
Abstract
Large language models (LLMs) are becoming an integrated part of software development. These models are trained on large datasets for code, where it is hard to verify each data point. Therefore, a potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, leading to compromising the integrity of the models in downstream tasks. In this paper, we propose an occlusion-based human-in-the-loop technique, OSeql, to distinguish trojan-triggering inputs of code. The technique is based on the observation that trojaned neural models of code rely heavily on the triggering part of input; hence, its removal would change the confidence of the models in their prediction substantially. Our results suggest that OSeql can detect the triggering inputs with almost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Software Reliability and Analysis Research
