A Red Teaming Roadmap Towards System-Level Safety
Zifan Wang, Christina Q. Knight, Jeremy Kritz, Willow E. Primack, Julian Michael

TL;DR
This paper advocates for a system-level approach to red teaming in AI safety, emphasizing realistic threat models and safety specifications to better address emerging AI risks.
Contribution
It proposes a prioritized framework for red teaming that focuses on safety specifications, realistic threats, and system-level safety in AI models.
Findings
Red teaming should focus more on safety specifications than social biases.
Prioritize realistic threat models reflecting actual attacker capabilities.
System-level safety measures are crucial for effective threat mitigation.
Abstract
Large Language Model (LLM) safeguards, which implement request refusals, have become a widely adopted mitigation strategy against misuse. At the intersection of adversarial machine learning and AI safety, safeguard red teaming has effectively identified critical vulnerabilities in state-of-the-art refusal-trained LLMs. However, in our view the many conference submissions on LLM red teaming do not, in aggregate, prioritize the right research problems. First, testing against clear product safety specifications should take a higher priority than abstract social biases or ethical principles. Second, red teaming should prioritize realistic threat models that represent the expanding risk landscape and what real attackers might do. Finally, we contend that system-level safety is a necessary step to move red teaming research forward, as AI models present new threats as well as affordances for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI
