HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi, Cheng

TL;DR
HiddenGuard introduces a fine-grained moderation framework for LLMs that detects and redacts harmful content at the token level, enabling nuanced, context-aware safe generation without complete refusal.
Contribution
It presents Prism, a novel token-level detection and redaction mechanism integrated with LLMs, and provides a dataset with fine-grained harmful content annotations.
Findings
Achieves over 90% F1 score in harmful content detection and redaction
Maintains model utility and informativeness during safe generation
Enables nuanced moderation beyond binary refusal strategies
Abstract
As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Machine Learning and Data Classification
