SALLIE: Safeguarding Against Latent Language & Image Exploits
Guy Azov, Ofer Rivlin, and Guy Shtar

TL;DR
SALLIE is a lightweight, modal-agnostic runtime detection framework that safeguards large models against textual and visual exploits by analyzing internal activations with a three-stage process.
Contribution
The paper introduces SALLIE, a novel unified defense method that detects and mitigates both textual and visual threats without degrading model performance or requiring architectural changes.
Findings
SALLIE outperforms baseline methods across multiple datasets and models.
It effectively detects latent exploits in vision-language models.
The framework maintains high inference efficiency suitable for real-world deployment.
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to textual and visual jailbreaks, as well as prompt injections (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213). Existing defenses often degrade performance through complex input transformations or treat multimodal threats as isolated problems (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). To address the critical gap for a unified, modal-agnostic defense that mitigates both textual and visual threats simultaneously without degrading performance or requiring architectural modifications, we introduce SALLIE (Safeguarding Against Latent Language & Image Exploits), a lightweight runtime detection framework rooted in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). By integrating seamlessly into standard token-level fusion pipelines (arXiv:2306.13549),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
