Occlusion-based Detection of Trojan-triggering Inputs in Large Language   Models of Code

Aftab Hussain; Md Rafiqul Islam Rabin; Toufique Ahmed; Mohammad Amin; Alipour; Bowen Xu

arXiv:2312.04004·cs.SE·December 12, 2023·1 cites

Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code

Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin, Alipour, Bowen Xu

PDF

Open Access

TL;DR

This paper introduces OSeql, an occlusion-based human-in-the-loop method for detecting trojan-triggering inputs in large language models of code, achieving near-perfect detection recall.

Contribution

The paper presents a novel occlusion-based technique, OSeql, for identifying trojan triggers in code models, addressing a critical security vulnerability.

Findings

01

OSeql detects trojan triggers with almost 100% recall.

02

The method highlights the importance of input parts in model predictions.

03

Discussion on false positives and mitigation strategies.

Abstract

Large language models (LLMs) are becoming an integrated part of software development. These models are trained on large datasets for code, where it is hard to verify each data point. Therefore, a potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, leading to compromising the integrity of the models in downstream tasks. In this paper, we propose an occlusion-based human-in-the-loop technique, OSeql, to distinguish trojan-triggering inputs of code. The technique is based on the observation that trojaned neural models of code rely heavily on the triggering part of input; hence, its removal would change the confidence of the models in their prediction substantially. Our results suggest that OSeql can detect the triggering inputs with almost…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Software Reliability and Analysis Research