Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Salahuddin Salahuddin; Ahmed Hussain; Jussi L\"opp\"onen; Toni Jutila; and Panos Papadimitratos

arXiv:2507.02964·cs.CL·July 8, 2025

Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Salahuddin Salahuddin, Ahmed Hussain, Jussi L\"opp\"onen, Toni Jutila, and Panos Papadimitratos

PDF

TL;DR

This paper presents a resource-efficient domain-adaptive pretraining method for large language models to improve cybersecurity understanding, achieving state-of-the-art results with significantly less data and computational resources.

Contribution

It introduces a novel, minimal-token continuous pretraining approach for cybersecurity LLMs that balances domain expertise with knowledge retention, outperforming larger datasets.

Findings

01

Achieved state-of-the-art accuracy on cybersecurity benchmarks.

02

Demonstrated effective domain adaptation with only 118.8 million tokens.

03

Validated the approach's computational efficiency and practicality.

Abstract

While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models lack specialized domain knowledge for effective cybersecurity analysis. In this work, we investigate Domain-Adaptive Continuous Pretraining (DAP) as a methodology for enhancing cybersecurity understanding in pretrained LLMs while preserving general language capabilities. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources. Our approach employed constrained training parameters and distributed FSDP training to balance domain specialization with knowledge preservation. Evaluation across three cybersecurity benchmarks, namely, CTI-MCQ, CyberMetric, and SecEval, demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.