Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland

TL;DR
This paper investigates temporal vulnerabilities in large language models, demonstrating that models can distinguish past from future events and that backdoors triggered by future data can be detected and mitigated with current safety measures.
Contribution
It reveals the existence of temporal backdoor triggers in LLMs and evaluates the effectiveness of fine-tuning in removing such backdoors, providing initial evidence on safety measures.
Findings
Models can distinguish past from future events with 90% accuracy.
Backdoors triggered by future news headlines can be activated in LLMs.
Fine-tuning on helpful data can mitigate some backdoors, especially in smaller models.
Abstract
Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Big Data and Business Intelligence · Data Quality and Management
