SecureBreak -- A dataset towards safe and secure models

Marco Arazzi; Vignesh Kumar Kembu; Antonino Nocera

arXiv:2603.21975·cs.CR·March 24, 2026

SecureBreak -- A dataset towards safe and secure models

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera

PDF

Open Access

TL;DR

SecureBreak is a carefully annotated dataset aimed at improving the detection of unsafe outputs in large language models, enhancing security alignment and robustness against attacks like jailbreaking.

Contribution

The paper introduces SecureBreak, a new dataset for detecting harmful LLM outputs, supporting safety filtering and security enhancement efforts.

Findings

01

Fine-tuning on SecureBreak improves detection of unsafe content.

02

The dataset performs well across multiple risk categories.

03

SecureBreak aids in both post-generation filtering and model alignment.

Abstract

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security