Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Eric Easley; Sebastian Farquhar

arXiv:2604.10403·cs.LG·April 14, 2026

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Eric Easley, Sebastian Farquhar

PDF

TL;DR

This paper introduces LIRA, a method to improve large language models' robustness against jailbreaks and backdoors by aligning their instruction representations, significantly enhancing security and unlearning capabilities.

Contribution

LIRA is a novel approach that trains LLMs to change how they interpret instructions, improving defense against malicious prompts and unlearning of undesired knowledge.

Findings

01

Blocks over 99% of PEZ jailbreak attacks.

02

Removes a challenging insecure code backdoor.

03

Achieves optimal forgetting on WMDP cyber with minimal loss of benign capabilities.

Abstract

We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.