A Framework for Real-time Safeguarding the Text Generation of Large Language Model
Ximing Dong, Dayi Lin, Shaowei Wang, Ahmed E. Hassan

TL;DR
This paper introduces LLMSafeGuard, a lightweight, real-time framework that enhances the safety of large language models by rejecting unsafe outputs through an external validator, improving safety and efficiency.
Contribution
The paper presents a novel similarity-based validation method and a context-wise timing strategy for real-time safeguarding of LLMs without retraining control models.
Findings
Reduces toxic outputs by at least 38.6% in detoxification tasks.
Cuts inference time by at least 24.2% with maintained effectiveness.
Outperforms state-of-the-art baselines in safety and efficiency.
Abstract
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. Existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight real-time framework that integrates an external validator into decoding, rejecting unsafe outputs while allowing valid ones. We introduce a similarity-based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on detoxification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
