SecureBreak -- A dataset towards safe and secure models
Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera

TL;DR
SecureBreak is a carefully annotated dataset aimed at improving the detection of unsafe outputs in large language models, enhancing security alignment and robustness against attacks like jailbreaking.
Contribution
The paper introduces SecureBreak, a new dataset for detecting harmful LLM outputs, supporting safety filtering and security enhancement efforts.
Findings
Fine-tuning on SecureBreak improves detection of unsafe content.
The dataset performs well across multiple risk categories.
SecureBreak aids in both post-generation filtering and model alignment.
Abstract
Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security
