TL;DR
StealthGraph introduces a framework that generates domain-specific, implicit harmful prompts for LLM safety testing using knowledge graphs and obfuscation techniques, enhancing realism in safety evaluations.
Contribution
The paper presents a novel end-to-end method combining knowledge-graph-guided prompt generation and obfuscation rewriting to produce realistic, domain-specific harmful prompts.
Findings
Generated datasets are highly domain-relevant and implicit.
The approach improves the realism of red-teaming for LLM safety.
Code and datasets are publicly available on GitHub.
Abstract
Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies two-strategy obfuscation rewriting to convert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
