ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
Zeming Wei, Chengcan Wu, Meng Sun

TL;DR
ReGA introduces a scalable, model-based framework leveraging safety-critical representations to improve LLM safety, effectively detecting harmful prompts with high accuracy and robustness.
Contribution
The paper presents ReGA, a novel safety analysis framework for LLMs that uses representation-guided abstraction to address scalability and interpretability challenges.
Findings
Achieves AUROC of 0.975 at prompt level and 0.985 at conversation level.
Demonstrates robustness to real-world attacks.
Outperforms existing safeguards in interpretability and scalability.
Abstract
Large Language Models (LLMs) have achieved tremendous success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks of generating harmful content and are vulnerable to jailbreaking attacks, creating unaddressed security issues regarding their deployments. In the context of software engineering for artificial intelligence (SE4AI) techniques, model-based analysis has demonstrated notable potential for analyzing and monitoring machine learning models, particularly in stateful deep neural networks. However, it suffers from scalability issues when extended to LLMs due to their vast feature spaces. In this paper, we aim to address the scalability issue of model-based analysis techniques for safeguarding LLM-scale models. Motivated by the recent discovery of low-dimensional safety-critical representations that emerged in LLMs, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
