ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Darpan Aswal; C\'eline Hudelot

arXiv:2508.16325·cs.CL·December 16, 2025

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Darpan Aswal, C\'eline Hudelot

PDF

Open Access

TL;DR

ConceptGuard introduces a neuro-symbolic framework using sparse autoencoders to identify interpretable internal concepts in LLMs, enabling explainable and robust safety guardrails against jailbreak attacks without additional fine-tuning.

Contribution

It presents a novel approach leveraging sparse autoencoders to extract meaningful internal representations for safety, enhancing robustness and interpretability of defenses against jailbreaks.

Findings

01

Shared activation geometry for jailbreak attacks identified

02

ConceptGuard provides explainable safety guardrails

03

No additional fine-tuning required for robustness

Abstract

Large Language Models have found success in a variety of applications. However, their safety remains a concern due to the existence of various jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a range of vulnerabilities, including targeted misuse and accidental user profiling. This work introduces \textbf{ConceptGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, ConceptGuard enables building robust safety guardrails -- offering fully explainable and generalizable defenses without sacrificing model capabilities or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Explainable Artificial Intelligence (XAI)