HiddenGuard: Fine-Grained Safe Generation with Specialized   Representation Router

Lingrui Mei; Shenghua Liu; Yiwei Wang; Baolong Bi; Ruibin Yuan; Xueqi; Cheng

arXiv:2410.02684·cs.CL·October 4, 2024

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi, Cheng

PDF

Open Access 1 Repo

TL;DR

HiddenGuard introduces a fine-grained moderation framework for LLMs that detects and redacts harmful content at the token level, enabling nuanced, context-aware safe generation without complete refusal.

Contribution

It presents Prism, a novel token-level detection and redaction mechanism integrated with LLMs, and provides a dataset with fine-grained harmful content annotations.

Findings

01

Achieves over 90% F1 score in harmful content detection and redaction

02

Maintains model utility and informativeness during safe generation

03

Enables nuanced moderation beyond binary refusal strategies

Abstract

As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Meirtz/HiddenGuard
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Machine Learning and Data Classification