DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents
Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Wenqin Shao

TL;DR
This paper introduces DeepKnown-Guard, a safety framework for LLMs that enhances risk detection and response at input and output levels, achieving high safety scores and robustness in critical applications.
Contribution
The paper presents a novel, proprietary safety response framework combining fine-grained risk classification and retrieval-augmented generation to improve LLM safety and trustworthiness.
Findings
Achieved 99.3% risk recall rate in input safety classification.
Attained 100% safety score on high-risk test set.
Significantly outperformed baseline safety models on public benchmarks.
Abstract
With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
