N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
Zheyu Lin, Jirui Yang, Yukui Qiu, Hengqi Guo, Yubing Bao, Yao Guan

TL;DR
N-GLARE introduces a novel, efficient method for evaluating LLM safety by analyzing latent representations instead of generating text, enabling faster and cost-effective safety diagnostics.
Contribution
It proposes a new latent representation-based safety evaluation method, JSS metric, that reduces costs and latency compared to traditional red teaming approaches.
Findings
JSS correlates strongly with safety rankings from red teaming.
N-GLARE achieves similar discriminative results at less than 1% of token and runtime costs.
The method enables real-time safety diagnostics without text generation.
Abstract
Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Software Testing and Debugging Techniques
