Trojan Detection in Large Language Models: Insights from The Trojan   Detection Challenge

Narek Maloyan; Ekansh Verma; Bulat Nutfullin; Bislan Ashinov

arXiv:2404.13660·cs.CL·April 23, 2024·2 cites

Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge

Narek Maloyan, Ekansh Verma, Bulat Nutfullin, Bislan Ashinov

PDF

Open Access

TL;DR

This paper analyzes the Trojan Detection Challenge 2023, revealing the significant difficulty in reliably detecting trojans in large language models and highlighting the need for more robust detection methods.

Contribution

It provides a comprehensive analysis of trojan detection methods in LLMs, showing current limitations and insights from the competition to guide future research.

Findings

01

Top detection recall scores around 0.16

02

Detectability of trojans is comparable to random sampling

03

Unintended triggers pose significant challenges

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, but their vulnerability to trojan or backdoor attacks poses significant security risks. This paper explores the challenges and insights gained from the Trojan Detection Competition 2023 (TDC2023), which focused on identifying and evaluating trojan attacks on LLMs. We investigate the difficulty of distinguishing between intended and unintended triggers, as well as the feasibility of reverse engineering trojans in real-world scenarios. Our comparative analysis of various trojan detection methods reveals that achieving high Recall scores is significantly more challenging than obtaining high Reverse-Engineering Attack Success Rate (REASR) scores. The top-performing methods in the competition achieved Recall scores around 0.16, comparable to a simple baseline of randomly sampling sentences from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning