Trojan Detection Through Pattern Recognition for Large Language Models

Vedant Bhasin; Matthew Yudin; Razvan Stefanescu; Rauf Izmailov

arXiv:2501.11621·cs.CL·January 22, 2025

Trojan Detection Through Pattern Recognition for Large Language Models

Vedant Bhasin, Matthew Yudin, Razvan Stefanescu, Rauf Izmailov

PDF

Open Access

TL;DR

This paper introduces a multistage framework for detecting Trojan backdoors in large language models by identifying and verifying potential triggers, addressing the challenge of their detection in vast search spaces.

Contribution

It proposes a novel multistage detection framework with trigger verification techniques and two black-box trigger inversion methods for large language models.

Findings

01

Effective detection of Trojan triggers on TrojAI and RLHF datasets

02

Verification stage improves trigger differentiation accuracy

03

Black-box trigger inversion methods outperform existing approaches

Abstract

Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning