AttackQA: Development and Adoption of a Dataset for Assisting Cybersecurity Operations using Fine-tuned and Open-Source LLMs
Varun Badrinath Krishna

TL;DR
This paper introduces AttackQA, a large cybersecurity Q&A dataset created with open-source LLMs, and demonstrates how fine-tuning open-source models enhances cybersecurity response accuracy in RAG systems.
Contribution
The paper presents a new cybersecurity Q&A dataset and shows that fine-tuning open-source LLMs improves RAG system performance over proprietary models.
Findings
Fine-tuning open-source models outperforms GPT-4o in accuracy.
The dataset contains 25,335 Q&A pairs with rationales.
Open-source RAG system achieves high-speed, high-accuracy results.
Abstract
Retrieval-augmented generation (RAG) on specialized domain datasets has shown improved performance when large language models (LLMs) are fine-tuned for generating responses to user queries. In this study, we develop a cybersecurity question-answering (Q\&A) dataset, called AttackQA, and employ it to build a RAG-based Q\&A system designed for analysts in security operations centers. The dataset comprises 25,335 Q\&A pairs, accompanied by rationales to facilitate fine-tuning and evaluation. 80\% of the dataset was generated with help of a lightweight open-source LLM (LLama 3 8B), which produced over 1100 tokens per second with full 16-bit precision on SambaNova System's SN40L specialized hardware. To ensure dataset quality, we fine-tuned LLama 3 70B to detect and reject low-quality Q\&A pairs. In using the dataset for RAG, we demonstrate that fine-tuning open-source embeddings and LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Big Data and Business Intelligence
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Dropout · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · WordPiece · Adam
