Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu; Tsun-Han Chiang; Cheng-Wei Tsai; Chien-Ming Huang; Wen-Kwang Tsao

arXiv:2502.11191·cs.CR·October 6, 2025

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

PDF

Open Access 10 Models 5 Datasets 1 Video

TL;DR

This paper introduces Primus, a comprehensive set of open-source datasets for training cybersecurity LLMs, significantly improving their performance on benchmarks and security certifications.

Contribution

It provides the first extensive collection of cybersecurity datasets for all training stages, enabling better LLM pretraining and fine-tuning in cybersecurity.

Findings

01

15.9% improvement in aggregate benchmark scores

02

15.8% gain in security certification accuracy

03

Effective ablation studies validating dataset usefulness

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training· underline

Taxonomy

TopicsDigital and Cyber Forensics