Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

TL;DR
This paper introduces Primus, a comprehensive set of open-source datasets for training cybersecurity LLMs, significantly improving their performance on benchmarks and security certifications.
Contribution
It provides the first extensive collection of cybersecurity datasets for all training stages, enabling better LLM pretraining and fine-tuning in cybersecurity.
Findings
15.9% improvement in aggregate benchmark scores
15.8% gain in security certification accuracy
Effective ablation studies validating dataset usefulness
Abstract
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗trendmicro-ailab/Llama-Primus-Mergedmodel· 3.5k dl· ♡ 143.5k dl♡ 14
- 🤗trendmicro-ailab/Llama-Primus-Basemodel· 26 dl· ♡ 1226 dl♡ 12
- 🤗trendmicro-ailab/Llama-Primus-Reasoningmodel· 34 dl· ♡ 1734 dl♡ 17
- 🤗krgl/Llama-Primus-Base_8bit-ggufmodel· 3 dl3 dl
- 🤗krgl/Llama-Primus-Reasoning_8bit-ggufmodel· 1 dl1 dl
- 🤗trend-cybertron/Llama-Primus-Basemodel· ♡ 3♡ 3
- 🤗trend-cybertron/Llama-Primus-Mergedmodel· ♡ 2♡ 2
- 🤗trend-cybertron/Llama-Primus-Reasoningmodel· ♡ 2♡ 2
- 🤗trend-cybertron/Llama-Primus-Nemotron-70B-Instructmodel· 5.4k dl· ♡ 145.4k dl♡ 14
- 🤗trend-cybertron/Llama-Primus-Nemotron-70B-Basemodel· 6 dl· ♡ 66 dl♡ 6
Videos
Taxonomy
TopicsDigital and Cyber Forensics
