TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering
Scott Thornton

TL;DR
TRYLOCK is a layered defense architecture for large language models that combines multiple mechanisms to significantly reduce jailbreak success rates while maintaining usability, and it provides full reproducibility of its methods.
Contribution
This work introduces TRYLOCK, the first layered defense-in-depth system combining four heterogeneous mechanisms to improve LLM security against jailbreaks.
Findings
Achieves 88.0% reduction in attack success rate on Mistral-7B-Instruct.
RepE blocks 36% of attacks bypassing DPO alone.
Canonicalization catches 14% of encoding-based attacks.
Abstract
Large language models remain vulnerable to jailbreak attacks, and single-layer defenses often trade security for usability. We present TRYLOCK, the first defense-in-depth architecture that combines four heterogeneous mechanisms across the inference stack: weight-level safety alignment via DPO, activation-level control via Representation Engineering (RepE) steering, adaptive steering strength selected by a lightweight sidecar classifier, and input canonicalization to neutralize encoding-based bypasses. On Mistral-7B-Instruct evaluated against a 249-prompt attack set spanning five attack families, TRYLOCK achieves 88.0% relative ASR reduction (46.5% to 5.6%), with each layer contributing unique coverage: RepE blocks 36% of attacks that bypass DPO alone, while canonicalization catches 14% of encoding attacks that evade both. We discover a non-monotonic steering phenomenon -- intermediate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
