Trojan Detection Through Pattern Recognition for Large Language Models
Vedant Bhasin, Matthew Yudin, Razvan Stefanescu, Rauf Izmailov

TL;DR
This paper introduces a multistage framework for detecting Trojan backdoors in large language models by identifying and verifying potential triggers, addressing the challenge of their detection in vast search spaces.
Contribution
It proposes a novel multistage detection framework with trigger verification techniques and two black-box trigger inversion methods for large language models.
Findings
Effective detection of Trojan triggers on TrojAI and RLHF datasets
Verification stage improves trigger differentiation accuracy
Black-box trigger inversion methods outperform existing approaches
Abstract
Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
