Future Events as Backdoor Triggers: Investigating Temporal   Vulnerabilities in LLMs

Sara Price; Arjun Panickssery; Sam Bowman; Asa Cooper Stickland

arXiv:2407.04108·cs.CR·December 25, 2024·1 cites

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland

PDF

Open Access 1 Repo

TL;DR

This paper investigates temporal vulnerabilities in large language models, demonstrating that models can distinguish past from future events and that backdoors triggered by future data can be detected and mitigated with current safety measures.

Contribution

It reveals the existence of temporal backdoor triggers in LLMs and evaluates the effectiveness of fine-tuning in removing such backdoors, providing initial evidence on safety measures.

Findings

01

Models can distinguish past from future events with 90% accuracy.

02

Backdoors triggered by future news headlines can be activated in LLMs.

03

Fine-tuning on helpful data can mitigate some backdoors, especially in smaller models.

Abstract

Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sbp354/future_triggered_backdoors
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Big Data and Business Intelligence · Data Quality and Management