Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu, Siyuan Li, Yuliang Chen

TL;DR
This paper introduces EditRisk-Bench, a comprehensive benchmark for evaluating safety risks in knowledge-intensive reasoning when malicious knowledge editing is performed on large language models.
Contribution
It provides a unified framework to assess how malicious knowledge editing impacts reasoning safety, including diverse scenarios and evaluation metrics.
Findings
Malicious knowledge editing can induce unsafe reasoning without degrading overall model performance.
Factors like edit scale, knowledge traits, and reasoning complexity significantly influence safety risks.
EditRisk-Bench enables systematic detection and mitigation of safety issues in knowledge editing.
Abstract
Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
