Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
Eric Easley, Sebastian Farquhar

TL;DR
This paper introduces LIRA, a method to improve large language models' robustness against jailbreaks and backdoors by aligning their instruction representations, significantly enhancing security and unlearning capabilities.
Contribution
LIRA is a novel approach that trains LLMs to change how they interpret instructions, improving defense against malicious prompts and unlearning of undesired knowledge.
Findings
Blocks over 99% of PEZ jailbreak attacks.
Removes a challenging insecure code backdoor.
Achieves optimal forgetting on WMDP cyber with minimal loss of benign capabilities.
Abstract
We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
