Exploring the Secondary Risks of Large Language Models
Jiawei Chen, Zhengwei Fang, Yu Tian, Jiawei Du, Chao Yu, Zhaoxia Yin, Hang Su

TL;DR
This paper identifies a new class of risks in large language models called secondary risks, which involve harmful behaviors during benign interactions, and introduces tools to evaluate and address these risks.
Contribution
The authors define secondary risks, develop SecLens for eliciting these risks, and release SecRiskBench for systematic evaluation of LLM safety.
Findings
Secondary risks are widespread across models.
Secondary risks transfer between different models.
Secondary risks are modality independent.
Abstract
Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
